The Feynman experience · Part 5

Multiple linear regression

evergreen · Mar 16, 2020

Introduction

This is the multi-feature sequel to simple linear regression. For a visual intuition, Luis Serrano’s explanation is a good start;¹ the Homemade Machine Learning² and Machine Learning Basics³ repositories are also worth a look.

Let’s generate some correlated three-dimensional data to model. As throughout this series, we work from scratch in plain Python with NumPy and SciPy:

import numpy as np

np.random.seed(51)

real_mean = [1, 3, 5]
real_cov = np.array([[4, 1, 2], [1, 9, -4], [2, -4, 21]])
n = 100

data = np.random.multivariate_normal(real_mean, real_cov, n)

The model

With a vector of $k$ features $\mathbf{x} = [x_1, x_2, \dots, x_k]$ , the model of $Y$ given $\mathbf{x}$ is

y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \dots + \beta_k x_{ki} + \epsilon_i

which is far more compact in matrix notation. Collecting the targets into a vector $\mathbf{Y}$ and the features into a design matrix $\mathbf{X}$ with a leading column of ones,

\mathbf{X} = \begin{bmatrix} 1 & x_{11} & \cdots & x_{1k} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & \cdots & x_{nk} \end{bmatrix}

the model becomes

\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}

For a sample of size $n$ with $k$ features we have $\mathbf{Y}_{n \times 1}$ , a matrix $\mathbf{X}_{n \times p}$ with $p = k + 1$ , the parameter vector $\boldsymbol{\beta}_{p \times 1}$ , and the error vector $\boldsymbol{\epsilon}_{n \times 1}$ , with $E(\boldsymbol{\epsilon}) = \mathbf{0}$ and $\mathrm{Cov}(\boldsymbol{\epsilon}) = \sigma^2 \mathbf{I}$ .

Parameter estimation

The residual sum of squares in matrix form is

RSS(\boldsymbol{\beta}) = (\mathbf{Y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{Y} - \mathbf{X}\boldsymbol{\beta})

and the least-squares estimator minimizes it:

\hat{\boldsymbol{\beta}} = \operatorname*{arg\,min}_{\boldsymbol{\beta}} (\mathbf{Y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{Y} - \mathbf{X}\boldsymbol{\beta})

Differentiating with respect to $\boldsymbol{\beta}$ and setting the result to zero gives a closed form:

\begin{aligned} \frac{\partial RSS}{\partial \boldsymbol{\beta}} &= -2\mathbf{X}^T(\mathbf{Y} - \mathbf{X}\hat{\boldsymbol{\beta}}) = 0 \\ \mathbf{X}^T\mathbf{X}\hat{\boldsymbol{\beta}} &= \mathbf{X}^T\mathbf{Y} \\ \hat{\boldsymbol{\beta}} &= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y} \end{aligned}

Translating that directly into NumPy — features in $\mathbf{X}$ (with a column of ones), the third variable as the target $\mathbf{Y}$ :

X = np.c_[np.ones(n), data[:, :2]]
Y = data[:, 2]

beta_hat = np.linalg.inv(X.T @ X) @ X.T @ Y
np.round(beta_hat, 3)
# → array([ 6.278,  0.743, -0.607])

We can reach the same coefficients by minimizing the cost directly with SciPy:

from scipy.optimize import minimize

def rss(beta, X, Y):
    r = Y - X @ beta
    return r @ r

beta_optim = minimize(rss, [0, 0, 0], args=(X, Y)).x
np.round(beta_optim, 3)
# → array([ 6.278,  0.743, -0.607])

A 3D scatter of the data with the fitted regression plane passing through the cloud of points. — The data and the fitted regression plane through the cloud of points.

Properties of the estimators

Since $\hat{\boldsymbol{\beta}}$ is a function of the random sample, it is a vector of random variables. Its expected value is

\begin{aligned} E(\hat{\boldsymbol{\beta}}) &= E\!\left[(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T(\mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon})\right] \\ &= \boldsymbol{\beta} + (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T E(\boldsymbol{\epsilon}) = \boldsymbol{\beta} \end{aligned}

so the estimator is unbiased. For the covariance, using $\mathrm{Cov}(\boldsymbol{\epsilon}) = \sigma^2 \mathbf{I}$ ,

\begin{aligned} \mathrm{Cov}(\hat{\boldsymbol{\beta}}) &= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\, \mathrm{Cov}(\boldsymbol{\epsilon})\, \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1} \\ &= \sigma^2 (\mathbf{X}^T\mathbf{X})^{-1} \end{aligned}

In practice we don’t know $\sigma^2$ , so we estimate it with $\hat{\sigma}^2 = \frac{\sum_i (y_i - \hat{y})^2}{n - p}$ . The diagonal of the covariance matrix holds the variances of each coefficient:

cov_hat = real_cov[2, 2] * np.linalg.inv(X.T @ X)
np.round(np.diag(cov_hat), 4)
# → array([0.5102, 0.0633, 0.0281])

To check these, we can recover the true coefficients implied by the population mean and covariance,⁴

real_betas = np.linalg.solve(real_cov[:2, :2], real_cov[:2, 2])
real_beta_0 = real_mean[2] - np.array(real_mean[:2]) @ real_betas
np.round([real_beta_0, *real_betas], 5)
# → array([ 5.91429,  0.62857, -0.51429])

and then estimate $\hat{\boldsymbol{\beta}}$ across a hundred fresh samples to see how the coefficients are distributed:

np.random.seed(51)
data_100 = np.random.multivariate_normal(real_mean, real_cov, (100, n))

beta_100 = np.array([
    np.linalg.inv(Xi.T @ Xi) @ Xi.T @ d[:, 2]
    for d in data_100
    for Xi in [np.c_[np.ones(n), d[:, :2]]]
])

Three histograms showing the sampling distributions of the three coefficient estimates, each matching its expected normal curve. — Each coefficient's estimates across a hundred samples (bars) centre on the true value with the expected normal sampling distribution (orange).

The coefficients read just as in the single-feature case: a one-unit change in feature $k$ shifts the mean target by $\beta_k$ , and when every feature is 0 the mean target is $\beta_0$ .

Regularization

When features are many or correlated, it helps to constrain the coefficients. Three standard techniques are LASSO, ridge, and elastic net; the Elements of Statistical Learning covers them in depth.⁵ The picture below shows the idea: we shrink the least-squares solution toward the origin by confining $\boldsymbol{\beta}$ to a region — a diamond for LASSO ( $L_1$ ), a circle for ridge ( $L_2$ ) — and the solution lands where an RSS contour first touches it.

The constrained solution (red) sits where an RSS contour first meets the constraint region. The diamond's corners make LASSO prone to setting coefficients exactly to zero; the smooth circle does not.

LASSO

LASSO — least absolute shrinkage and selection operator — adds an $L_1$ penalty to the objective:

\hat{\boldsymbol{\beta}}^{\text{lasso}} = \operatorname*{arg\,min}_{\boldsymbol{\beta}} \left\{(\mathbf{Y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{Y} - \mathbf{X}\boldsymbol{\beta}) + \lambda\sum_{j=1}^{p} |\beta_j|\right\}

The parameter $\lambda \geq 0$ controls the shrinkage; at $\lambda = 0$ we recover ordinary least squares. As a cost function:

def rss_lasso(beta, X, Y, lbd):
    r = Y - X @ beta
    return r @ r + lbd * np.sum(np.abs(beta))

beta_lasso = minimize(rss_lasso, [0, 0, 0], args=(X, Y, 5)).x
np.round(beta_lasso, 3)
# → array([ 6.217,  0.743, -0.594])

Ridge

Ridge swaps the $L_1$ penalty for an $L_2$ one:

\hat{\boldsymbol{\beta}}^{\text{ridge}} = \operatorname*{arg\,min}_{\boldsymbol{\beta}} \left\{(\mathbf{Y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{Y} - \mathbf{X}\boldsymbol{\beta}) + \lambda\sum_{j=1}^{p} \beta_j^2\right\}

def rss_ridge(beta, X, Y, lbd):
    r = Y - X @ beta
    return r @ r + lbd * np.sum(beta ** 2)

beta_ridge = minimize(rss_ridge, [0, 0, 0], args=(X, Y, 5)).x
np.round(beta_ridge, 3)
# → array([ 5.602,  0.824, -0.505])

Elastic net

Both are special cases of a penalty $\lambda\sum_j |\beta_j|^q$ . Choosing $q \in (1, 2)$ compromises between the two,³ but loses LASSO’s knack for driving coefficients exactly to zero. The elastic net keeps that feature selection while still shrinking correlated coefficients like ridge, by blending the two penalties:

\hat{\boldsymbol{\beta}}^{\text{elastic}} = \operatorname*{arg\,min}_{\boldsymbol{\beta}} \left\{(\mathbf{Y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{Y} - \mathbf{X}\boldsymbol{\beta}) + \lambda\sum_{j=1}^{p} \left(\alpha\beta_j^2 + (1 - \alpha)|\beta_j|\right)\right\}

with $\alpha \in [0, 1]$ : at $\alpha = 0$ it is LASSO, at $\alpha = 1$ it is ridge.

def rss_elastic(beta, X, Y, lbd, alpha):
    r = Y - X @ beta
    return r @ r + lbd * np.sum(alpha * beta ** 2 + (1 - alpha) * np.abs(beta))

beta_elastic = minimize(rss_elastic, [0, 0, 0], args=(X, Y, 5, 0.5)).x
np.round(beta_elastic, 3)
# → array([ 5.891,  0.786, -0.547])

Conclusion

From a single feature to many, the same least-squares idea carries over — only now it is a tidy matrix expression — and regularization gives us a dial between fitting the data and keeping the coefficients in check. That rounds out the regression thread, and this series: a handful of foundational ideas in probability and statistics, each built from scratch with nothing but NumPy and SciPy.

Serrano, L. A Friendly Introduction to Linear Regression. https://www.youtube.com/watch?v=wYPUhge9w5c ↩
Trekhleb, O. Homemade Machine Learning. https://github.com/trekhleb/homemade-machine-learning ↩
Peters, A.-L. Machine Learning Basics. https://github.com/zotroneneis/machine_learning_basics ↩ ↩²
Using the covariance matrix to find coefficients for multiple regression. Cross Validated. https://stats.stackexchange.com/questions/107597 ↩
Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning (2nd ed.). (Springer, 2009). ↩