mineti.dev

The Feynman experience · Part 4

Simple linear regression

evergreen · Mar 16, 2020

Introduction

We want to model a linear relationship between a feature xx and a scalar target yy. For a friendly visual intuition before the math, Luis Serrano’s explanation is excellent.1

The model

We model the target as

Y=β0+β1x+ϵY = \beta_0 + \beta_1 x + \epsilon

where ϵNormal(0,σ2)\epsilon \sim \text{Normal}(0, \sigma^2). Because YY is defined through ϵ\epsilon, it is itself a random variable, with expected value

E(Yx)=E(β0+β1x+ϵ)=(β0+β1x)+E(ϵ)=β0+β1x\begin{aligned} E(Y \mid x) &= E(\beta_0 + \beta_1 x + \epsilon) \\ &= (\beta_0 + \beta_1 x) + E(\epsilon) \\ &= \beta_0 + \beta_1 x \end{aligned}

and variance

V(Yx)=V(β0+β1x+ϵ)=V(ϵ)=σ2\begin{aligned} V(Y \mid x) &= V(\beta_0 + \beta_1 x + \epsilon) \\ &= V(\epsilon) = \sigma^2 \end{aligned}

Note the variance of YY is the same for every xx — a property called homogeneity of variance, or homoscedasticity.2

Let’s generate some data to work with. As throughout this series, we work from scratch in plain Python with NumPy and SciPy:

import numpy as np

np.random.seed(51)

real_beta_0, real_beta_1, real_sigma2 = 2, 3, 1
n = 100

x = np.linspace(0, 5, n)
y = real_beta_0 + real_beta_1 * x + real_sigma2 * np.random.randn(n)
Scatter plot of 100 generated points with a clear upward linear trend.
The generated data: 100 points following y = 2 + 3x plus normal noise.

Parameter estimation

Given a sample {(x1,y1),,(xn,yn)}\{(x_1, y_1), \dots, (x_n, y_n)\}, we estimate β0\beta_0 and β1\beta_1 by least squares. The error metric is the residual sum of squares:3

RSS(β0,β1)=i=1n(yiβ0β1xi)2RSS(\beta_0, \beta_1) = \sum_{i=1}^{n}(y_i - \beta_0 - \beta_1 x_i)^2

and the estimates minimize it:

β^0,β^1=arg minβ0,β1i=1n(yiβ0β1xi)2\hat{\beta}_0, \hat{\beta}_1 = \operatorname*{arg\,min}_{\beta_0, \beta_1} \sum_{i=1}^{n}(y_i - \beta_0 - \beta_1 x_i)^2

Differentiating with respect to each parameter and setting the result to zero — first for β0\beta_0,

RSSβ0=2i=1n(yiβ^0β^1xi)=0i=1nyinβ^0β^1i=1nxi=0nβ^0+β^1nxˉ=nyˉ\begin{aligned} \frac{\partial RSS}{\partial \beta_0} = -2\sum_{i=1}^{n}(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i) &= 0 \\ \sum_{i=1}^{n} y_i - n\hat{\beta}_0 - \hat{\beta}_1 \sum_{i=1}^{n} x_i &= 0 \\ n\hat{\beta}_0 + \hat{\beta}_1 n\bar{x} &= n\bar{y} \end{aligned}

and then for β1\beta_1,

RSSβ1=2i=1n(yiβ^0β^1xi)xi=0β^0nxˉ+β^1i=1nxi2=i=1nyixi\begin{aligned} \frac{\partial RSS}{\partial \beta_1} = -2\sum_{i=1}^{n}(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)x_i &= 0 \\ \hat{\beta}_0 n\bar{x} + \hat{\beta}_1 \sum_{i=1}^{n} x_i^2 &= \sum_{i=1}^{n} y_i x_i \end{aligned}

where xˉ=1nixi\bar{x} = \frac{1}{n}\sum_i x_i and yˉ=1niyi\bar{y} = \frac{1}{n}\sum_i y_i. These two normal equations form a linear system,

nβ^0+nxˉβ^1=nyˉnxˉβ^0+(i=1nxi2)β^1=i=1nyixi\begin{aligned} n\hat{\beta}_0 + n\bar{x}\,\hat{\beta}_1 &= n\bar{y} \\ n\bar{x}\,\hat{\beta}_0 + \left(\textstyle\sum_{i=1}^{n} x_i^2\right)\hat{\beta}_1 &= \sum_{i=1}^{n} y_i x_i \end{aligned}

whose solution is

β^0=yˉβ^1xˉβ^1=i=1nyi(xixˉ)i=1n(xixˉ)2\begin{aligned} \hat{\beta}_0 &= \bar{y} - \hat{\beta}_1 \bar{x} \\ \hat{\beta}_1 &= \frac{\sum_{i=1}^{n} y_i (x_i - \bar{x})}{\sum_{i=1}^{n}(x_i - \bar{x})^2} \end{aligned}

Plugging in our data:

beta_1 = np.sum(y * (x - np.mean(x))) / np.sum((x - np.mean(x)) ** 2)
beta_0 = np.mean(y) - beta_1 * np.mean(x)

np.round([beta_0, beta_1], 3)
# → array([1.925, 3.028])

Close to the true (2,3)(2, 3). We could also reach the same place with a different algorithm — gradient descent,4 say — or hand the objective to SciPy’s optimizer:

from scipy.optimize import minimize

def rss(beta, x, y):
    return np.sum((y - beta[0] - beta[1] * x) ** 2)

beta_0, beta_1 = minimize(rss, [0, 0], args=(x, y)).x

np.round([beta_0, beta_1], 3)
# → array([1.925, 3.028])
The generated points with the fitted regression line running through them.
The least-squares fit, ŷ = 1.925 + 3.028x.

Properties of the estimators

Since β^0\hat{\beta}_0 and β^1\hat{\beta}_1 are functions of the random sample, they are themselves random variables, with their own expected values and variances.56

Slope β^1\hat{\beta}_1

Writing β^1=iciYi\hat{\beta}_1 = \sum_i c_i Y_i with ci=xixˉj(xjxˉ)2c_i = \frac{x_i - \bar{x}}{\sum_j (x_j - \bar{x})^2},

E(β^1)=i=1nciE(Yi)=i=1nci(β0+β1xi)=β0i=1nci+β1i=1ncixi\begin{aligned} E(\hat{\beta}_1) &= \sum_{i=1}^{n} c_i\, E(Y_i) = \sum_{i=1}^{n} c_i (\beta_0 + \beta_1 x_i) \\ &= \beta_0 \sum_{i=1}^{n} c_i + \beta_1 \sum_{i=1}^{n} c_i x_i \end{aligned}

Now ici=0\sum_i c_i = 0 (the numerator i(xixˉ)\sum_i (x_i - \bar{x}) vanishes), and

i=1ncixi=i=1nxi(xixˉ)i=1n(xixˉ)2=i=1nxi2nxˉ2i=1nxi2nxˉ2=1\begin{aligned} \sum_{i=1}^{n} c_i x_i &= \frac{\sum_{i=1}^{n} x_i(x_i - \bar{x})}{\sum_{i=1}^{n}(x_i - \bar{x})^2} = \frac{\sum_{i=1}^{n} x_i^2 - n\bar{x}^2}{\sum_{i=1}^{n} x_i^2 - n\bar{x}^2} = 1 \end{aligned}

so E(β^1)=β1E(\hat{\beta}_1) = \beta_1 — the estimator is unbiased. Its variance, using V(Yi)=σ2V(Y_i) = \sigma^2, is

V(β^1)=i=1nci2V(Yi)=σ2i=1n(xixˉ)2V(\hat{\beta}_1) = \sum_{i=1}^{n} c_i^2\, V(Y_i) = \frac{\sigma^2}{\sum_{i=1}^{n}(x_i - \bar{x})^2}

In practice we rarely know σ2\sigma^2, so we substitute the estimator σ^2=i(yiy^)2n2\hat{\sigma}^2 = \frac{\sum_i (y_i - \hat{y})^2}{n - 2}. Computing the variance for our data:

b1_var = real_sigma2 / np.sum((x - np.mean(x)) ** 2)
np.round(b1_var, 5)
# → 0.0047

To check it, we can estimate β1\beta_1 across a hundred fresh samples and look at the spread:

np.random.seed(51)
y_100 = real_beta_0 + real_beta_1 * x + real_sigma2 * np.random.randn(100, n)

b1 = [minimize(rss, [0, 0], args=(x, yi)).x[1] for yi in y_100]
Histogram of 100 slope estimates centered on 3, matching the expected normal sampling distribution.
The hundred slope estimates centre on the true slope β₁ = 3 with variance ≈ 0.0047, matching theory.

Intercept β^0\hat{\beta}_0

The same approach applies to the intercept. From β^0=yˉβ^1xˉ\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x},

E(β^0)=E(yˉ)xˉE(β^1)=(β0+β1xˉ)β1xˉ=β0\begin{aligned} E(\hat{\beta}_0) &= E(\bar{y}) - \bar{x}\, E(\hat{\beta}_1) \\ &= (\beta_0 + \beta_1 \bar{x}) - \beta_1 \bar{x} = \beta_0 \end{aligned}

and, since Cov(Yˉ,β^1)=0\mathrm{Cov}(\bar{Y}, \hat{\beta}_1) = 0,

V(β^0)=V(Yˉ)+xˉ2V(β^1)=σ2(1n+xˉ2i=1n(xixˉ)2)\begin{aligned} V(\hat{\beta}_0) &= V(\bar{Y}) + \bar{x}^2 V(\hat{\beta}_1) \\ &= \sigma^2\left(\frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^{n}(x_i - \bar{x})^2}\right) \end{aligned}
b0_var = (1 / n + np.mean(x) ** 2 / np.sum((x - np.mean(x)) ** 2)) * real_sigma2
np.round(b0_var, 5)
# → 0.03941
Histogram of 100 intercept estimates centered on 2, matching the expected normal sampling distribution.
The hundred intercept estimates centre on the true intercept β₀ = 2 with variance ≈ 0.039.

A confidence band for the mean response

The fitted mean at a point x0x_0, μ^Yx0=β^0+β^1x0\hat{\mu}_{Y|x_0} = \hat{\beta}_0 + \hat{\beta}_1 x_0, is also a random variable. Its expected value is

E(μ^Yx0)=E(β^0)+E(β^1)x0=β0+β1x0\begin{aligned} E(\hat{\mu}_{Y|x_0}) &= E(\hat{\beta}_0) + E(\hat{\beta}_1) x_0 = \beta_0 + \beta_1 x_0 \end{aligned}

and, writing it as yˉ+β^1(x0xˉ)\bar{y} + \hat{\beta}_1(x_0 - \bar{x}) and using Cov(yˉ,β^1)=0\mathrm{Cov}(\bar{y}, \hat{\beta}_1) = 0, its variance is

V(μ^Yx0)=σ2(1n+(x0xˉ)2i=1n(xixˉ)2)V(\hat{\mu}_{Y|x_0}) = \sigma^2\left(\frac{1}{n} + \frac{(x_0 - \bar{x})^2}{\sum_{i=1}^{n}(x_i - \bar{x})^2}\right)

Turning that into a band around the fit:

y_hat = beta_0 + beta_1 * x

y_hat_var = (np.sum((y - y_hat) ** 2) / (n - 2)) * \
    (1 / n + (x - np.mean(x)) ** 2 / np.sum((x - np.mean(x)) ** 2))

y_hat_upper = y_hat + 1.96 * np.sqrt(y_hat_var)
y_hat_lower = y_hat - 1.96 * np.sqrt(y_hat_var)
The fitted line with a dashed 95% confidence band that is narrowest near the centre of the data and widens toward the edges.
The fit with a 95% confidence band for the mean response. The band is narrowest near the mean of x and flares at the edges.

The band is not parallel to the line — and the variance above tells us why: it grows with (x0xˉ)2(x_0 - \bar{x})^2, so our estimate of the mean is most precise near the centre of the data and least precise out at the extremes.

Coefficient of determination

The coefficient of determination, R2R^2, is the share of the target’s variance explained by the feature. It lies between 0 and 1:

R2=1i=1n(yiy^)2i=1n(yiyˉ)2R^2 = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y})^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2}
r2 = 1 - np.sum((y - y_hat) ** 2) / np.sum((y - np.mean(y)) ** 2)
np.round(r2, 2)
# → 0.95

So the model explains about 95% of the variance in yy — unsurprising, since we built the data from a line.

Interpretation

The coefficients read directly: a one-unit change in the predictor shifts the mean target by β1\beta_1, and when the predictor is 0 the mean target is β0\beta_0.

Conclusion

Linear regression is the standard first candidate for a regression problem, and a good choice when you want to understand how each predictor affects the target. The natural next step is more than one feature — see multiple linear regression, the next part of this series.

References

  1. Serrano, L. A Friendly Introduction to Linear Regression. https://www.youtube.com/watch?v=wYPUhge9w5c

  2. Homoscedasticity. Wikipedia. https://en.wikipedia.org/wiki/Homoscedasticity

  3. Residual sum of squares. Wikipedia. https://en.wikipedia.org/wiki/Residual_sum_of_squares

  4. Gradient descent. Wikipedia. https://en.wikipedia.org/wiki/Gradient_descent

  5. Montgomery, D. C., Peck, E. A. & Vining, G. G. Introduction to Linear Regression Analysis (5th ed.). (Wiley, 2012).

  6. Costa, M. A. Tópicos em Ciência dos Dados: Introdução aos Modelos Paramétricos. (Bonecker, 2019).

← All articles