The Feynman experience · Part 4
Simple linear regression evergreen · Mar 16, 2020
Introduction
TL;DR
We fit a straight line to data from scratch: derive the least-squares estimates of the slope and intercept by hand, recover them numerically, study how those estimates vary from sample to sample, and add a confidence band and R² .
We want to model a linear relationship between a feature x x x and a scalar target
y y y . For a friendly visual intuition before the math, Luis Serrano’s
explanation is excellent.1
The model
We model the target as
Y = β 0 + β 1 x + ϵ Y = \beta_0 + \beta_1 x + \epsilon Y = β 0 + β 1 x + ϵ
where ϵ ∼ Normal ( 0 , σ 2 ) \epsilon \sim \text{Normal}(0, \sigma^2) ϵ ∼ Normal ( 0 , σ 2 ) . Because Y Y Y is defined through
ϵ \epsilon ϵ , it is itself a random variable, with expected value
E ( Y ∣ x ) = E ( β 0 + β 1 x + ϵ ) = ( β 0 + β 1 x ) + E ( ϵ ) = β 0 + β 1 x \begin{aligned}
E(Y \mid x) &= E(\beta_0 + \beta_1 x + \epsilon) \\
&= (\beta_0 + \beta_1 x) + E(\epsilon) \\
&= \beta_0 + \beta_1 x
\end{aligned} E ( Y ∣ x ) = E ( β 0 + β 1 x + ϵ ) = ( β 0 + β 1 x ) + E ( ϵ ) = β 0 + β 1 x
and variance
V ( Y ∣ x ) = V ( β 0 + β 1 x + ϵ ) = V ( ϵ ) = σ 2 \begin{aligned}
V(Y \mid x) &= V(\beta_0 + \beta_1 x + \epsilon) \\
&= V(\epsilon) = \sigma^2
\end{aligned} V ( Y ∣ x ) = V ( β 0 + β 1 x + ϵ ) = V ( ϵ ) = σ 2
Note the variance of Y Y Y is the same for every x x x — a property called
homogeneity of variance, or
homoscedasticity .2
Let’s generate some data to work with. As throughout this series, we work from
scratch in plain Python with NumPy and SciPy:
import numpy as np
np . random . seed ( 51 )
real_beta_0 , real_beta_1 , real_sigma2 = 2 , 3 , 1
n = 100
x = np . linspace ( 0 , 5 , n )
y = real_beta_0 + real_beta_1 * x + real_sigma2 * np . random . randn ( n )
The generated data: 100 points following y = 2 + 3x plus normal noise.
Parameter estimation
Given a sample { ( x 1 , y 1 ) , … , ( x n , y n ) } \{(x_1, y_1), \dots, (x_n, y_n)\} {( x 1 , y 1 ) , … , ( x n , y n )} , we estimate β 0 \beta_0 β 0 and
β 1 \beta_1 β 1 by least squares. The error metric is the
residual sum of squares :
R S S ( β 0 , β 1 ) = ∑ i = 1 n ( y i − β 0 − β 1 x i ) 2 RSS(\beta_0, \beta_1) = \sum_{i=1}^{n}(y_i - \beta_0 - \beta_1 x_i)^2 R S S ( β 0 , β 1 ) = i = 1 ∑ n ( y i − β 0 − β 1 x i ) 2
and the estimates minimize it:
β ^ 0 , β ^ 1 = arg min β 0 , β 1 ∑ i = 1 n ( y i − β 0 − β 1 x i ) 2 \hat{\beta}_0, \hat{\beta}_1 = \operatorname*{arg\,min}_{\beta_0, \beta_1} \sum_{i=1}^{n}(y_i - \beta_0 - \beta_1 x_i)^2 β ^ 0 , β ^ 1 = β 0 , β 1 arg min i = 1 ∑ n ( y i − β 0 − β 1 x i ) 2
Differentiating with respect to each parameter and setting the result to zero —
first for β 0 \beta_0 β 0 ,
∂ R S S ∂ β 0 = − 2 ∑ i = 1 n ( y i − β ^ 0 − β ^ 1 x i ) = 0 ∑ i = 1 n y i − n β ^ 0 − β ^ 1 ∑ i = 1 n x i = 0 n β ^ 0 + β ^ 1 n x ˉ = n y ˉ \begin{aligned}
\frac{\partial RSS}{\partial \beta_0} = -2\sum_{i=1}^{n}(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i) &= 0 \\
\sum_{i=1}^{n} y_i - n\hat{\beta}_0 - \hat{\beta}_1 \sum_{i=1}^{n} x_i &= 0 \\
n\hat{\beta}_0 + \hat{\beta}_1 n\bar{x} &= n\bar{y}
\end{aligned} ∂ β 0 ∂ R S S = − 2 i = 1 ∑ n ( y i − β ^ 0 − β ^ 1 x i ) i = 1 ∑ n y i − n β ^ 0 − β ^ 1 i = 1 ∑ n x i n β ^ 0 + β ^ 1 n x ˉ = 0 = 0 = n y ˉ
and then for β 1 \beta_1 β 1 ,
∂ R S S ∂ β 1 = − 2 ∑ i = 1 n ( y i − β ^ 0 − β ^ 1 x i ) x i = 0 β ^ 0 n x ˉ + β ^ 1 ∑ i = 1 n x i 2 = ∑ i = 1 n y i x i \begin{aligned}
\frac{\partial RSS}{\partial \beta_1} = -2\sum_{i=1}^{n}(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)x_i &= 0 \\
\hat{\beta}_0 n\bar{x} + \hat{\beta}_1 \sum_{i=1}^{n} x_i^2 &= \sum_{i=1}^{n} y_i x_i
\end{aligned} ∂ β 1 ∂ R S S = − 2 i = 1 ∑ n ( y i − β ^ 0 − β ^ 1 x i ) x i β ^ 0 n x ˉ + β ^ 1 i = 1 ∑ n x i 2 = 0 = i = 1 ∑ n y i x i
where x ˉ = 1 n ∑ i x i \bar{x} = \frac{1}{n}\sum_i x_i x ˉ = n 1 ∑ i x i and y ˉ = 1 n ∑ i y i \bar{y} = \frac{1}{n}\sum_i y_i y ˉ = n 1 ∑ i y i .
These two normal equations form a linear system,
n β ^ 0 + n x ˉ β ^ 1 = n y ˉ n x ˉ β ^ 0 + ( ∑ i = 1 n x i 2 ) β ^ 1 = ∑ i = 1 n y i x i \begin{aligned}
n\hat{\beta}_0 + n\bar{x}\,\hat{\beta}_1 &= n\bar{y} \\
n\bar{x}\,\hat{\beta}_0 + \left(\textstyle\sum_{i=1}^{n} x_i^2\right)\hat{\beta}_1 &= \sum_{i=1}^{n} y_i x_i
\end{aligned} n β ^ 0 + n x ˉ β ^ 1 n x ˉ β ^ 0 + ( ∑ i = 1 n x i 2 ) β ^ 1 = n y ˉ = i = 1 ∑ n y i x i
whose solution is
β ^ 0 = y ˉ − β ^ 1 x ˉ β ^ 1 = ∑ i = 1 n y i ( x i − x ˉ ) ∑ i = 1 n ( x i − x ˉ ) 2 \begin{aligned}
\hat{\beta}_0 &= \bar{y} - \hat{\beta}_1 \bar{x} \\
\hat{\beta}_1 &= \frac{\sum_{i=1}^{n} y_i (x_i - \bar{x})}{\sum_{i=1}^{n}(x_i - \bar{x})^2}
\end{aligned} β ^ 0 β ^ 1 = y ˉ − β ^ 1 x ˉ = ∑ i = 1 n ( x i − x ˉ ) 2 ∑ i = 1 n y i ( x i − x ˉ )
Plugging in our data:
beta_1 = np . sum ( y * ( x - np . mean ( x ))) / np . sum (( x - np . mean ( x )) ** 2 )
beta_0 = np . mean ( y ) - beta_1 * np . mean ( x )
np . round ([ beta_0 , beta_1 ], 3 )
# → array([1.925, 3.028])
Close to the true ( 2 , 3 ) (2, 3) ( 2 , 3 ) . We could also reach the same place with a different
algorithm — gradient descent ,4
say — or hand the objective to SciPy’s optimizer:
from scipy . optimize import minimize
def rss ( beta , x , y ):
return np . sum (( y - beta [ 0 ] - beta [ 1 ] * x ) ** 2 )
beta_0 , beta_1 = minimize ( rss , [ 0 , 0 ], args =( x , y )). x
np . round ([ beta_0 , beta_1 ], 3 )
# → array([1.925, 3.028])
The least-squares fit, ŷ = 1.925 + 3.028x.
Properties of the estimators
Since β ^ 0 \hat{\beta}_0 β ^ 0 and β ^ 1 \hat{\beta}_1 β ^ 1 are functions of the random sample,
they are themselves random variables, with their own expected values and
variances.5 6
Slope β ^ 1 \hat{\beta}_1 β ^ 1
Writing β ^ 1 = ∑ i c i Y i \hat{\beta}_1 = \sum_i c_i Y_i β ^ 1 = ∑ i c i Y i with
c i = x i − x ˉ ∑ j ( x j − x ˉ ) 2 c_i = \frac{x_i - \bar{x}}{\sum_j (x_j - \bar{x})^2} c i = ∑ j ( x j − x ˉ ) 2 x i − x ˉ ,
E ( β ^ 1 ) = ∑ i = 1 n c i E ( Y i ) = ∑ i = 1 n c i ( β 0 + β 1 x i ) = β 0 ∑ i = 1 n c i + β 1 ∑ i = 1 n c i x i \begin{aligned}
E(\hat{\beta}_1) &= \sum_{i=1}^{n} c_i\, E(Y_i) = \sum_{i=1}^{n} c_i (\beta_0 + \beta_1 x_i) \\
&= \beta_0 \sum_{i=1}^{n} c_i + \beta_1 \sum_{i=1}^{n} c_i x_i
\end{aligned} E ( β ^ 1 ) = i = 1 ∑ n c i E ( Y i ) = i = 1 ∑ n c i ( β 0 + β 1 x i ) = β 0 i = 1 ∑ n c i + β 1 i = 1 ∑ n c i x i
Now ∑ i c i = 0 \sum_i c_i = 0 ∑ i c i = 0 (the numerator ∑ i ( x i − x ˉ ) \sum_i (x_i - \bar{x}) ∑ i ( x i − x ˉ ) vanishes), and
∑ i = 1 n c i x i = ∑ i = 1 n x i ( x i − x ˉ ) ∑ i = 1 n ( x i − x ˉ ) 2 = ∑ i = 1 n x i 2 − n x ˉ 2 ∑ i = 1 n x i 2 − n x ˉ 2 = 1 \begin{aligned}
\sum_{i=1}^{n} c_i x_i &= \frac{\sum_{i=1}^{n} x_i(x_i - \bar{x})}{\sum_{i=1}^{n}(x_i - \bar{x})^2} = \frac{\sum_{i=1}^{n} x_i^2 - n\bar{x}^2}{\sum_{i=1}^{n} x_i^2 - n\bar{x}^2} = 1
\end{aligned} i = 1 ∑ n c i x i = ∑ i = 1 n ( x i − x ˉ ) 2 ∑ i = 1 n x i ( x i − x ˉ ) = ∑ i = 1 n x i 2 − n x ˉ 2 ∑ i = 1 n x i 2 − n x ˉ 2 = 1
so E ( β ^ 1 ) = β 1 E(\hat{\beta}_1) = \beta_1 E ( β ^ 1 ) = β 1 — the estimator is unbiased. Its variance, using
V ( Y i ) = σ 2 V(Y_i) = \sigma^2 V ( Y i ) = σ 2 , is
V ( β ^ 1 ) = ∑ i = 1 n c i 2 V ( Y i ) = σ 2 ∑ i = 1 n ( x i − x ˉ ) 2 V(\hat{\beta}_1) = \sum_{i=1}^{n} c_i^2\, V(Y_i) = \frac{\sigma^2}{\sum_{i=1}^{n}(x_i - \bar{x})^2} V ( β ^ 1 ) = i = 1 ∑ n c i 2 V ( Y i ) = ∑ i = 1 n ( x i − x ˉ ) 2 σ 2
In practice we rarely know σ 2 \sigma^2 σ 2 , so we substitute the estimator
σ ^ 2 = ∑ i ( y i − y ^ ) 2 n − 2 \hat{\sigma}^2 = \frac{\sum_i (y_i - \hat{y})^2}{n - 2} σ ^ 2 = n − 2 ∑ i ( y i − y ^ ) 2 . Computing the variance
for our data:
b1_var = real_sigma2 / np . sum (( x - np . mean ( x )) ** 2 )
np . round ( b1_var , 5 )
# → 0.0047
To check it, we can estimate β 1 \beta_1 β 1 across a hundred fresh samples and look at
the spread:
np . random . seed ( 51 )
y_100 = real_beta_0 + real_beta_1 * x + real_sigma2 * np . random . randn ( 100 , n )
b1 = [ minimize ( rss , [ 0 , 0 ], args =( x , yi )). x [ 1 ] for yi in y_100 ]
The hundred slope estimates centre on the true slope β₁ = 3 with variance ≈ 0.0047, matching theory.
Intercept β ^ 0 \hat{\beta}_0 β ^ 0
The same approach applies to the intercept. From
β ^ 0 = y ˉ − β ^ 1 x ˉ \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x} β ^ 0 = y ˉ − β ^ 1 x ˉ ,
E ( β ^ 0 ) = E ( y ˉ ) − x ˉ E ( β ^ 1 ) = ( β 0 + β 1 x ˉ ) − β 1 x ˉ = β 0 \begin{aligned}
E(\hat{\beta}_0) &= E(\bar{y}) - \bar{x}\, E(\hat{\beta}_1) \\
&= (\beta_0 + \beta_1 \bar{x}) - \beta_1 \bar{x} = \beta_0
\end{aligned} E ( β ^ 0 ) = E ( y ˉ ) − x ˉ E ( β ^ 1 ) = ( β 0 + β 1 x ˉ ) − β 1 x ˉ = β 0
and, since C o v ( Y ˉ , β ^ 1 ) = 0 \mathrm{Cov}(\bar{Y}, \hat{\beta}_1) = 0 Cov ( Y ˉ , β ^ 1 ) = 0 ,
V ( β ^ 0 ) = V ( Y ˉ ) + x ˉ 2 V ( β ^ 1 ) = σ 2 ( 1 n + x ˉ 2 ∑ i = 1 n ( x i − x ˉ ) 2 ) \begin{aligned}
V(\hat{\beta}_0) &= V(\bar{Y}) + \bar{x}^2 V(\hat{\beta}_1) \\
&= \sigma^2\left(\frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^{n}(x_i - \bar{x})^2}\right)
\end{aligned} V ( β ^ 0 ) = V ( Y ˉ ) + x ˉ 2 V ( β ^ 1 ) = σ 2 ( n 1 + ∑ i = 1 n ( x i − x ˉ ) 2 x ˉ 2 )
b0_var = ( 1 / n + np . mean ( x ) ** 2 / np . sum (( x - np . mean ( x )) ** 2 )) * real_sigma2
np . round ( b0_var , 5 )
# → 0.03941
The hundred intercept estimates centre on the true intercept β₀ = 2 with variance ≈ 0.039.
A confidence band for the mean response
The fitted mean at a point x 0 x_0 x 0 , μ ^ Y ∣ x 0 = β ^ 0 + β ^ 1 x 0 \hat{\mu}_{Y|x_0} = \hat{\beta}_0 +
\hat{\beta}_1 x_0 μ ^ Y ∣ x 0 = β ^ 0 + β ^ 1 x 0 , is also a random variable. Its expected value is
E ( μ ^ Y ∣ x 0 ) = E ( β ^ 0 ) + E ( β ^ 1 ) x 0 = β 0 + β 1 x 0 \begin{aligned}
E(\hat{\mu}_{Y|x_0}) &= E(\hat{\beta}_0) + E(\hat{\beta}_1) x_0 = \beta_0 + \beta_1 x_0
\end{aligned} E ( μ ^ Y ∣ x 0 ) = E ( β ^ 0 ) + E ( β ^ 1 ) x 0 = β 0 + β 1 x 0
and, writing it as y ˉ + β ^ 1 ( x 0 − x ˉ ) \bar{y} + \hat{\beta}_1(x_0 - \bar{x}) y ˉ + β ^ 1 ( x 0 − x ˉ ) and using
C o v ( y ˉ , β ^ 1 ) = 0 \mathrm{Cov}(\bar{y}, \hat{\beta}_1) = 0 Cov ( y ˉ , β ^ 1 ) = 0 , its variance is
V ( μ ^ Y ∣ x 0 ) = σ 2 ( 1 n + ( x 0 − x ˉ ) 2 ∑ i = 1 n ( x i − x ˉ ) 2 ) V(\hat{\mu}_{Y|x_0}) = \sigma^2\left(\frac{1}{n} + \frac{(x_0 - \bar{x})^2}{\sum_{i=1}^{n}(x_i - \bar{x})^2}\right) V ( μ ^ Y ∣ x 0 ) = σ 2 ( n 1 + ∑ i = 1 n ( x i − x ˉ ) 2 ( x 0 − x ˉ ) 2 )
Turning that into a band around the fit:
y_hat = beta_0 + beta_1 * x
y_hat_var = ( np . sum (( y - y_hat ) ** 2 ) / ( n - 2 )) * \
( 1 / n + ( x - np . mean ( x )) ** 2 / np . sum (( x - np . mean ( x )) ** 2 ))
y_hat_upper = y_hat + 1.96 * np . sqrt ( y_hat_var )
y_hat_lower = y_hat - 1.96 * np . sqrt ( y_hat_var )
The fit with a 95% confidence band for the mean response. The band is narrowest near the mean of x and flares at the edges.
The band is not parallel to the line — and the variance above tells us why: it
grows with ( x 0 − x ˉ ) 2 (x_0 - \bar{x})^2 ( x 0 − x ˉ ) 2 , so our estimate of the mean is most precise near
the centre of the data and least precise out at the extremes.
Coefficient of determination
The coefficient of determination, R 2 R^2 R 2 , is the share of the target’s variance
explained by the feature. It lies between 0 and 1:
R 2 = 1 − ∑ i = 1 n ( y i − y ^ ) 2 ∑ i = 1 n ( y i − y ˉ ) 2 R^2 = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y})^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2} R 2 = 1 − ∑ i = 1 n ( y i − y ˉ ) 2 ∑ i = 1 n ( y i − y ^ ) 2
r2 = 1 - np . sum (( y - y_hat ) ** 2 ) / np . sum (( y - np . mean ( y )) ** 2 )
np . round ( r2 , 2 )
# → 0.95
So the model explains about 95% of the variance in y y y — unsurprising, since we
built the data from a line.
Interpretation
The coefficients read directly: a one-unit change in the predictor shifts the
mean target by β 1 \beta_1 β 1 , and when the predictor is 0 the mean target is
β 0 \beta_0 β 0 .
Conclusion
Linear regression is the standard first candidate for a regression problem, and a
good choice when you want to understand how each predictor affects the target.
The natural next step is more than one feature — see
multiple linear regression , the next part
of this series.
Originally published on GitHub on March 16, 2020. Adapted and
reformatted for mineti.dev — the
original version
remains available.