mineti.dev

The Feynman experience · Part 5

Multiple linear regression

evergreen · Mar 16, 2020

Introduction

This is the multi-feature sequel to simple linear regression. For a visual intuition, Luis Serrano’s explanation is a good start;1 the Homemade Machine Learning2 and Machine Learning Basics3 repositories are also worth a look.

Let’s generate some correlated three-dimensional data to model. As throughout this series, we work from scratch in plain Python with NumPy and SciPy:

import numpy as np

np.random.seed(51)

real_mean = [1, 3, 5]
real_cov = np.array([[4, 1, 2], [1, 9, -4], [2, -4, 21]])
n = 100

data = np.random.multivariate_normal(real_mean, real_cov, n)

The model

With a vector of kk features x=[x1,x2,,xk]\mathbf{x} = [x_1, x_2, \dots, x_k], the model of YY given x\mathbf{x} is

yi=β0+β1x1i+β2x2i++βkxki+ϵiy_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \dots + \beta_k x_{ki} + \epsilon_i

which is far more compact in matrix notation. Collecting the targets into a vector Y\mathbf{Y} and the features into a design matrix X\mathbf{X} with a leading column of ones,

X=[1x11x1k1xn1xnk]\mathbf{X} = \begin{bmatrix} 1 & x_{11} & \cdots & x_{1k} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & \cdots & x_{nk} \end{bmatrix}

the model becomes

Y=Xβ+ϵ\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}

For a sample of size nn with kk features we have Yn×1\mathbf{Y}_{n \times 1}, a matrix Xn×p\mathbf{X}_{n \times p} with p=k+1p = k + 1, the parameter vector βp×1\boldsymbol{\beta}_{p \times 1}, and the error vector ϵn×1\boldsymbol{\epsilon}_{n \times 1}, with E(ϵ)=0E(\boldsymbol{\epsilon}) = \mathbf{0} and Cov(ϵ)=σ2I\mathrm{Cov}(\boldsymbol{\epsilon}) = \sigma^2 \mathbf{I}.

Parameter estimation

The residual sum of squares in matrix form is

RSS(β)=(YXβ)T(YXβ)RSS(\boldsymbol{\beta}) = (\mathbf{Y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{Y} - \mathbf{X}\boldsymbol{\beta})

and the least-squares estimator minimizes it:

β^=arg minβ(YXβ)T(YXβ)\hat{\boldsymbol{\beta}} = \operatorname*{arg\,min}_{\boldsymbol{\beta}} (\mathbf{Y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{Y} - \mathbf{X}\boldsymbol{\beta})

Differentiating with respect to β\boldsymbol{\beta} and setting the result to zero gives a closed form:

RSSβ=2XT(YXβ^)=0XTXβ^=XTYβ^=(XTX)1XTY\begin{aligned} \frac{\partial RSS}{\partial \boldsymbol{\beta}} &= -2\mathbf{X}^T(\mathbf{Y} - \mathbf{X}\hat{\boldsymbol{\beta}}) = 0 \\ \mathbf{X}^T\mathbf{X}\hat{\boldsymbol{\beta}} &= \mathbf{X}^T\mathbf{Y} \\ \hat{\boldsymbol{\beta}} &= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y} \end{aligned}

Translating that directly into NumPy — features in X\mathbf{X} (with a column of ones), the third variable as the target Y\mathbf{Y}:

X = np.c_[np.ones(n), data[:, :2]]
Y = data[:, 2]

beta_hat = np.linalg.inv(X.T @ X) @ X.T @ Y
np.round(beta_hat, 3)
# → array([ 6.278,  0.743, -0.607])

We can reach the same coefficients by minimizing the cost directly with SciPy:

from scipy.optimize import minimize

def rss(beta, X, Y):
    r = Y - X @ beta
    return r @ r

beta_optim = minimize(rss, [0, 0, 0], args=(X, Y)).x
np.round(beta_optim, 3)
# → array([ 6.278,  0.743, -0.607])
A 3D scatter of the data with the fitted regression plane passing through the cloud of points.
The data and the fitted regression plane through the cloud of points.

Properties of the estimators

Since β^\hat{\boldsymbol{\beta}} is a function of the random sample, it is a vector of random variables. Its expected value is

E(β^)=E ⁣[(XTX)1XT(Xβ+ϵ)]=β+(XTX)1XTE(ϵ)=β\begin{aligned} E(\hat{\boldsymbol{\beta}}) &= E\!\left[(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T(\mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon})\right] \\ &= \boldsymbol{\beta} + (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T E(\boldsymbol{\epsilon}) = \boldsymbol{\beta} \end{aligned}

so the estimator is unbiased. For the covariance, using Cov(ϵ)=σ2I\mathrm{Cov}(\boldsymbol{\epsilon}) = \sigma^2 \mathbf{I},

Cov(β^)=(XTX)1XTCov(ϵ)X(XTX)1=σ2(XTX)1\begin{aligned} \mathrm{Cov}(\hat{\boldsymbol{\beta}}) &= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\, \mathrm{Cov}(\boldsymbol{\epsilon})\, \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1} \\ &= \sigma^2 (\mathbf{X}^T\mathbf{X})^{-1} \end{aligned}

In practice we don’t know σ2\sigma^2, so we estimate it with σ^2=i(yiy^)2np\hat{\sigma}^2 = \frac{\sum_i (y_i - \hat{y})^2}{n - p}. The diagonal of the covariance matrix holds the variances of each coefficient:

cov_hat = real_cov[2, 2] * np.linalg.inv(X.T @ X)
np.round(np.diag(cov_hat), 4)
# → array([0.5102, 0.0633, 0.0281])

To check these, we can recover the true coefficients implied by the population mean and covariance,4

real_betas = np.linalg.solve(real_cov[:2, :2], real_cov[:2, 2])
real_beta_0 = real_mean[2] - np.array(real_mean[:2]) @ real_betas
np.round([real_beta_0, *real_betas], 5)
# → array([ 5.91429,  0.62857, -0.51429])

and then estimate β^\hat{\boldsymbol{\beta}} across a hundred fresh samples to see how the coefficients are distributed:

np.random.seed(51)
data_100 = np.random.multivariate_normal(real_mean, real_cov, (100, n))

beta_100 = np.array([
    np.linalg.inv(Xi.T @ Xi) @ Xi.T @ d[:, 2]
    for d in data_100
    for Xi in [np.c_[np.ones(n), d[:, :2]]]
])
Three histograms showing the sampling distributions of the three coefficient estimates, each matching its expected normal curve.
Each coefficient's estimates across a hundred samples (bars) centre on the true value with the expected normal sampling distribution (orange).

The coefficients read just as in the single-feature case: a one-unit change in feature kk shifts the mean target by βk\beta_k, and when every feature is 0 the mean target is β0\beta_0.

Regularization

When features are many or correlated, it helps to constrain the coefficients. Three standard techniques are LASSO, ridge, and elastic net; the Elements of Statistical Learning covers them in depth.5 The picture below shows the idea: we shrink the least-squares solution toward the origin by confining β\boldsymbol{\beta} to a region — a diamond for LASSO (L1L_1), a circle for ridge (L2L_2) — and the solution lands where an RSS contour first touches it.

Two panels comparing the LASSO diamond constraint and the ridge circular constraint, with RSS contours touching each at the constrained solution.
The constrained solution (red) sits where an RSS contour first meets the constraint region. The diamond's corners make LASSO prone to setting coefficients exactly to zero; the smooth circle does not.

LASSO

LASSO — least absolute shrinkage and selection operator — adds an L1L_1 penalty to the objective:

β^lasso=arg minβ{(YXβ)T(YXβ)+λj=1pβj}\hat{\boldsymbol{\beta}}^{\text{lasso}} = \operatorname*{arg\,min}_{\boldsymbol{\beta}} \left\{(\mathbf{Y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{Y} - \mathbf{X}\boldsymbol{\beta}) + \lambda\sum_{j=1}^{p} |\beta_j|\right\}

The parameter λ0\lambda \geq 0 controls the shrinkage; at λ=0\lambda = 0 we recover ordinary least squares. As a cost function:

def rss_lasso(beta, X, Y, lbd):
    r = Y - X @ beta
    return r @ r + lbd * np.sum(np.abs(beta))

beta_lasso = minimize(rss_lasso, [0, 0, 0], args=(X, Y, 5)).x
np.round(beta_lasso, 3)
# → array([ 6.217,  0.743, -0.594])

Ridge

Ridge swaps the L1L_1 penalty for an L2L_2 one:

β^ridge=arg minβ{(YXβ)T(YXβ)+λj=1pβj2}\hat{\boldsymbol{\beta}}^{\text{ridge}} = \operatorname*{arg\,min}_{\boldsymbol{\beta}} \left\{(\mathbf{Y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{Y} - \mathbf{X}\boldsymbol{\beta}) + \lambda\sum_{j=1}^{p} \beta_j^2\right\}
def rss_ridge(beta, X, Y, lbd):
    r = Y - X @ beta
    return r @ r + lbd * np.sum(beta ** 2)

beta_ridge = minimize(rss_ridge, [0, 0, 0], args=(X, Y, 5)).x
np.round(beta_ridge, 3)
# → array([ 5.602,  0.824, -0.505])

Elastic net

Both are special cases of a penalty λjβjq\lambda\sum_j |\beta_j|^q. Choosing q(1,2)q \in (1, 2) compromises between the two,3 but loses LASSO’s knack for driving coefficients exactly to zero. The elastic net keeps that feature selection while still shrinking correlated coefficients like ridge, by blending the two penalties:

β^elastic=arg minβ{(YXβ)T(YXβ)+λj=1p(αβj2+(1α)βj)}\hat{\boldsymbol{\beta}}^{\text{elastic}} = \operatorname*{arg\,min}_{\boldsymbol{\beta}} \left\{(\mathbf{Y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{Y} - \mathbf{X}\boldsymbol{\beta}) + \lambda\sum_{j=1}^{p} \left(\alpha\beta_j^2 + (1 - \alpha)|\beta_j|\right)\right\}

with α[0,1]\alpha \in [0, 1]: at α=0\alpha = 0 it is LASSO, at α=1\alpha = 1 it is ridge.

def rss_elastic(beta, X, Y, lbd, alpha):
    r = Y - X @ beta
    return r @ r + lbd * np.sum(alpha * beta ** 2 + (1 - alpha) * np.abs(beta))

beta_elastic = minimize(rss_elastic, [0, 0, 0], args=(X, Y, 5, 0.5)).x
np.round(beta_elastic, 3)
# → array([ 5.891,  0.786, -0.547])

Conclusion

From a single feature to many, the same least-squares idea carries over — only now it is a tidy matrix expression — and regularization gives us a dial between fitting the data and keeping the coefficients in check. That rounds out the regression thread, and this series: a handful of foundational ideas in probability and statistics, each built from scratch with nothing but NumPy and SciPy.

References

  1. Serrano, L. A Friendly Introduction to Linear Regression. https://www.youtube.com/watch?v=wYPUhge9w5c

  2. Trekhleb, O. Homemade Machine Learning. https://github.com/trekhleb/homemade-machine-learning

  3. Peters, A.-L. Machine Learning Basics. https://github.com/zotroneneis/machine_learning_basics 2

  4. Using the covariance matrix to find coefficients for multiple regression. Cross Validated. https://stats.stackexchange.com/questions/107597

  5. Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning (2nd ed.). (Springer, 2009).

← All articles