The Feynman experience · Part 4

Simple linear regression

evergreen · Mar 16, 2020

Introduction

We want to model a linear relationship between a feature $x$ and a scalar target $y$ . For a friendly visual intuition before the math, Luis Serrano’s explanation is excellent.¹

The model

We model the target as

Y = \beta_0 + \beta_1 x + \epsilon

where $\epsilon \sim \text{Normal}(0, \sigma^2)$ . Because $Y$ is defined through $\epsilon$ , it is itself a random variable, with expected value

\begin{aligned} E(Y \mid x) &= E(\beta_0 + \beta_1 x + \epsilon) \\ &= (\beta_0 + \beta_1 x) + E(\epsilon) \\ &= \beta_0 + \beta_1 x \end{aligned}

and variance

\begin{aligned} V(Y \mid x) &= V(\beta_0 + \beta_1 x + \epsilon) \\ &= V(\epsilon) = \sigma^2 \end{aligned}

Note the variance of $Y$ is the same for every $x$ — a property called homogeneity of variance, or homoscedasticity.²

Let’s generate some data to work with. As throughout this series, we work from scratch in plain Python with NumPy and SciPy:

import numpy as np

np.random.seed(51)

real_beta_0, real_beta_1, real_sigma2 = 2, 3, 1
n = 100

x = np.linspace(0, 5, n)
y = real_beta_0 + real_beta_1 * x + real_sigma2 * np.random.randn(n)

Scatter plot of 100 generated points with a clear upward linear trend. — The generated data: 100 points following y = 2 + 3x plus normal noise.

Parameter estimation

Given a sample $\{(x_1, y_1), \dots, (x_n, y_n)\}$ , we estimate $\beta_0$ and $\beta_1$ by least squares. The error metric is the residual sum of squares:³

RSS(\beta_0, \beta_1) = \sum_{i=1}^{n}(y_i - \beta_0 - \beta_1 x_i)^2

and the estimates minimize it:

\hat{\beta}_0, \hat{\beta}_1 = \operatorname*{arg\,min}_{\beta_0, \beta_1} \sum_{i=1}^{n}(y_i - \beta_0 - \beta_1 x_i)^2

Differentiating with respect to each parameter and setting the result to zero — first for $\beta_0$ ,

\begin{aligned} \frac{\partial RSS}{\partial \beta_0} = -2\sum_{i=1}^{n}(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i) &= 0 \\ \sum_{i=1}^{n} y_i - n\hat{\beta}_0 - \hat{\beta}_1 \sum_{i=1}^{n} x_i &= 0 \\ n\hat{\beta}_0 + \hat{\beta}_1 n\bar{x} &= n\bar{y} \end{aligned}

and then for $\beta_1$ ,

\begin{aligned} \frac{\partial RSS}{\partial \beta_1} = -2\sum_{i=1}^{n}(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)x_i &= 0 \\ \hat{\beta}_0 n\bar{x} + \hat{\beta}_1 \sum_{i=1}^{n} x_i^2 &= \sum_{i=1}^{n} y_i x_i \end{aligned}

where $\bar{x} = \frac{1}{n}\sum_i x_i$ and $\bar{y} = \frac{1}{n}\sum_i y_i$ . These two normal equations form a linear system,

\begin{aligned} n\hat{\beta}_0 + n\bar{x}\,\hat{\beta}_1 &= n\bar{y} \\ n\bar{x}\,\hat{\beta}_0 + \left(\textstyle\sum_{i=1}^{n} x_i^2\right)\hat{\beta}_1 &= \sum_{i=1}^{n} y_i x_i \end{aligned}

whose solution is

\begin{aligned} \hat{\beta}_0 &= \bar{y} - \hat{\beta}_1 \bar{x} \\ \hat{\beta}_1 &= \frac{\sum_{i=1}^{n} y_i (x_i - \bar{x})}{\sum_{i=1}^{n}(x_i - \bar{x})^2} \end{aligned}

Plugging in our data:

beta_1 = np.sum(y * (x - np.mean(x))) / np.sum((x - np.mean(x)) ** 2)
beta_0 = np.mean(y) - beta_1 * np.mean(x)

np.round([beta_0, beta_1], 3)
# → array([1.925, 3.028])

Close to the true $(2, 3)$ . We could also reach the same place with a different algorithm — gradient descent,⁴ say — or hand the objective to SciPy’s optimizer:

from scipy.optimize import minimize

def rss(beta, x, y):
    return np.sum((y - beta[0] - beta[1] * x) ** 2)

beta_0, beta_1 = minimize(rss, [0, 0], args=(x, y)).x

np.round([beta_0, beta_1], 3)
# → array([1.925, 3.028])

The generated points with the fitted regression line running through them. — The least-squares fit, ŷ = 1.925 + 3.028x.

Properties of the estimators

Since $\hat{\beta}_0$ and $\hat{\beta}_1$ are functions of the random sample, they are themselves random variables, with their own expected values and variances.⁵⁶

Slope $\hat{\beta}_1$

Writing $\hat{\beta}_1 = \sum_i c_i Y_i$ with $c_i = \frac{x_i - \bar{x}}{\sum_j (x_j - \bar{x})^2}$ ,

\begin{aligned} E(\hat{\beta}_1) &= \sum_{i=1}^{n} c_i\, E(Y_i) = \sum_{i=1}^{n} c_i (\beta_0 + \beta_1 x_i) \\ &= \beta_0 \sum_{i=1}^{n} c_i + \beta_1 \sum_{i=1}^{n} c_i x_i \end{aligned}

Now $\sum_i c_i = 0$ (the numerator $\sum_i (x_i - \bar{x})$ vanishes), and

\begin{aligned} \sum_{i=1}^{n} c_i x_i &= \frac{\sum_{i=1}^{n} x_i(x_i - \bar{x})}{\sum_{i=1}^{n}(x_i - \bar{x})^2} = \frac{\sum_{i=1}^{n} x_i^2 - n\bar{x}^2}{\sum_{i=1}^{n} x_i^2 - n\bar{x}^2} = 1 \end{aligned}

so $E(\hat{\beta}_1) = \beta_1$ — the estimator is unbiased. Its variance, using $V(Y_i) = \sigma^2$ , is

V(\hat{\beta}_1) = \sum_{i=1}^{n} c_i^2\, V(Y_i) = \frac{\sigma^2}{\sum_{i=1}^{n}(x_i - \bar{x})^2}

In practice we rarely know $\sigma^2$ , so we substitute the estimator $\hat{\sigma}^2 = \frac{\sum_i (y_i - \hat{y})^2}{n - 2}$ . Computing the variance for our data:

b1_var = real_sigma2 / np.sum((x - np.mean(x)) ** 2)
np.round(b1_var, 5)
# → 0.0047

To check it, we can estimate $\beta_1$ across a hundred fresh samples and look at the spread:

np.random.seed(51)
y_100 = real_beta_0 + real_beta_1 * x + real_sigma2 * np.random.randn(100, n)

b1 = [minimize(rss, [0, 0], args=(x, yi)).x[1] for yi in y_100]

Histogram of 100 slope estimates centered on 3, matching the expected normal sampling distribution. — The hundred slope estimates centre on the true slope β₁ = 3 with variance ≈ 0.0047, matching theory.

Intercept $\hat{\beta}_0$

The same approach applies to the intercept. From $\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}$ ,

\begin{aligned} E(\hat{\beta}_0) &= E(\bar{y}) - \bar{x}\, E(\hat{\beta}_1) \\ &= (\beta_0 + \beta_1 \bar{x}) - \beta_1 \bar{x} = \beta_0 \end{aligned}

and, since $\mathrm{Cov}(\bar{Y}, \hat{\beta}_1) = 0$ ,

\begin{aligned} V(\hat{\beta}_0) &= V(\bar{Y}) + \bar{x}^2 V(\hat{\beta}_1) \\ &= \sigma^2\left(\frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^{n}(x_i - \bar{x})^2}\right) \end{aligned}

b0_var = (1 / n + np.mean(x) ** 2 / np.sum((x - np.mean(x)) ** 2)) * real_sigma2
np.round(b0_var, 5)
# → 0.03941

Histogram of 100 intercept estimates centered on 2, matching the expected normal sampling distribution. — The hundred intercept estimates centre on the true intercept β₀ = 2 with variance ≈ 0.039.

A confidence band for the mean response

The fitted mean at a point $x_0$ , $\hat{\mu}_{Y|x_0} = \hat{\beta}_0 + \hat{\beta}_1 x_0$ , is also a random variable. Its expected value is

\begin{aligned} E(\hat{\mu}_{Y|x_0}) &= E(\hat{\beta}_0) + E(\hat{\beta}_1) x_0 = \beta_0 + \beta_1 x_0 \end{aligned}

and, writing it as $\bar{y} + \hat{\beta}_1(x_0 - \bar{x})$ and using $\mathrm{Cov}(\bar{y}, \hat{\beta}_1) = 0$ , its variance is

V(\hat{\mu}_{Y|x_0}) = \sigma^2\left(\frac{1}{n} + \frac{(x_0 - \bar{x})^2}{\sum_{i=1}^{n}(x_i - \bar{x})^2}\right)

Turning that into a band around the fit:

y_hat = beta_0 + beta_1 * x

y_hat_var = (np.sum((y - y_hat) ** 2) / (n - 2)) * \
    (1 / n + (x - np.mean(x)) ** 2 / np.sum((x - np.mean(x)) ** 2))

y_hat_upper = y_hat + 1.96 * np.sqrt(y_hat_var)
y_hat_lower = y_hat - 1.96 * np.sqrt(y_hat_var)

The fitted line with a dashed 95% confidence band that is narrowest near the centre of the data and widens toward the edges. — The fit with a 95% confidence band for the mean response. The band is narrowest near the mean of x and flares at the edges.

The band is not parallel to the line — and the variance above tells us why: it grows with $(x_0 - \bar{x})^2$ , so our estimate of the mean is most precise near the centre of the data and least precise out at the extremes.

Coefficient of determination

The coefficient of determination, $R^2$ , is the share of the target’s variance explained by the feature. It lies between 0 and 1:

R^2 = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y})^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2}

r2 = 1 - np.sum((y - y_hat) ** 2) / np.sum((y - np.mean(y)) ** 2)
np.round(r2, 2)
# → 0.95

So the model explains about 95% of the variance in $y$ — unsurprising, since we built the data from a line.

Interpretation

The coefficients read directly: a one-unit change in the predictor shifts the mean target by $\beta_1$ , and when the predictor is 0 the mean target is $\beta_0$ .

Conclusion

Linear regression is the standard first candidate for a regression problem, and a good choice when you want to understand how each predictor affects the target. The natural next step is more than one feature — see multiple linear regression, the next part of this series.

Serrano, L. A Friendly Introduction to Linear Regression. https://www.youtube.com/watch?v=wYPUhge9w5c ↩
Homoscedasticity. Wikipedia. https://en.wikipedia.org/wiki/Homoscedasticity ↩
Residual sum of squares. Wikipedia. https://en.wikipedia.org/wiki/Residual_sum_of_squares ↩
Gradient descent. Wikipedia. https://en.wikipedia.org/wiki/Gradient_descent ↩
Montgomery, D. C., Peck, E. A. & Vining, G. G. Introduction to Linear Regression Analysis (5th ed.). (Wiley, 2012). ↩
Costa, M. A. Tópicos em Ciência dos Dados: Introdução aos Modelos Paramétricos. (Bonecker, 2019). ↩