Ridge regression

Author

Prof. Maria Tackett

Published

Mar 19, 2026

Announcements

HW 03 due TODAY at 11:59pm
Project exploratory data analysis due TODAY at 11:59pm (on GitHub)
Project presentations - in lab March 27
Statistics experience due April 2

Submit questions by 5pm today. Form at the end of today’s notes.

AE 05

Topics

Ridge regression

Computing set up

# load packages
library(tidyverse)  
library(tidymodels)  
library(knitr)       
library(patchwork)
library(rms)

# set default theme in ggplot2
ggplot2::theme_set(ggplot2::theme_bw())

Data: Age of abalones

The data for this analysis contains measurements for abalones, a type of marine snail. These measurements were collected and analyzed by researchers in Warwick et al. (1994). Click here for the publication.

The 4177 abalones in this study can be reasonably treated as a random sample.

The goal of this analysis is to use characteristics of the abalones to predict their age.

Variables

Sex: Male (M), Female (F), Infant (I)
Length: Longest shell measurement (in millimeters)
Diameter: Measured perpendicular to length (in millimeters)
Height : Measured with meat in shell (in millimeters)
Whole_Weight: Total weight of abalone (in grams)
Age: Age (in years)

Fit a model

age_fit <- lm(Age ~ Sex + Length + Diameter + Height
              + Whole_Weight, data = abalones)

tidy(age_fit) |> kable(digits = 3)

term	estimate	std.error	statistic	p.value
(Intercept)	5.643	0.336	16.795	0.000
SexI	-1.084	0.119	-9.151	0.000
SexM	-0.145	0.097	-1.492	0.136
Length	-0.054	0.010	-5.136	0.000
Diameter	0.114	0.013	8.903	0.000
Height	0.094	0.009	10.590	0.000
Whole_Weight	-0.001	0.001	-0.642	0.521

Potential issue

vif(age_fit)

        SexI         SexM       Length     Diameter       Height Whole_Weight 
    1.950083     1.386901    39.867076    41.170644     3.487132     7.709595

What is the potential issue with this model?
How does this impact the model results?

Limitations of the least-squares estimator

When \((\mathbf{X}^\mathsf{T}\mathbf{X})\) is nearly singular, \(Var(\hat{\boldsymbol{\beta}})\) gets large, and thus we have unstable coefficient estimates . This occurs when
- There are collinear predictors
- \(p\) is large relative to \(n\)
Additionally, if \(p > n\) , then we cannot find \(\hat{\boldsymbol{\beta}}\)
We can use a different estimator that addresses these limitations

Finding the ridge estimator

We find the ridge estimator, \(\hat{\boldsymbol{\beta}}_{ridge}\) by minimizing

\[ \underbrace{(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\mathsf{T}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})}_{\text{SSR}} + \underbrace{\lambda \boldsymbol{\beta}^\mathsf{T}\boldsymbol{\beta}}_{\text{penalty}} \]

where \(\lambda \geq 0\) is the tuning parameter determined by the statistician (more on this later)

. . .

Like least-squares, ridge regression seeks to minimize the sum of squares
The penalty, \(\lambda\boldsymbol{\beta}^\mathsf{T}\boldsymbol{\beta} = \lambda\sum_{j = 1}^p\beta^2_j\) preferences smaller values of \(\beta_j\), so ridge will shrink the coefficients to 0 (this does not apply to the intercept)

Finding the ridge estimator

We find the ridge estimator, \(\hat{\boldsymbol{\beta}}_{ridge}\) by minimizing

What will happen to \(\hat{\boldsymbol{\beta}}_{ridge}\) as \(\lambda \rightarrow 0\)? What will happen as\(\lambda \rightarrow \infty\) ?

Finding the ridge estimator

The ridge optimization problem can also be written

\[\text{minimize }(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\mathsf{T}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) \hspace{2mm} \text{subject to} \hspace{2mm} ||\boldsymbol{\beta}||^2_2 \leq c^2\] for some constant \(c\) . Note: \(||\boldsymbol{\beta}||_2 = \sqrt{\sum_{j=1}^p\beta_j^2}\)

. . .

Image source: Introduction to Statistical Learning

Red contours are combinations of \(\beta_1\) and \(\beta_2\) that result in same SSR
Blue circle is set of possible values of \(\beta_1\) and \(\beta_2\) given the constraint
Ridge solution at tangent

See Section 1.5 of Lecture notes on ridge regression for details.

Ridge estimator \(\hat{\boldsymbol{\beta}}_{ridge}\)

The ridge estimator is

\[ \hat{\boldsymbol{\beta}}_{ridge} = (\mathbf{X}^\mathsf{T}\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^\mathsf{T}\mathbf{y} \]

(see Exam 01 for derivation)

We have addressed the singularity of \((\mathbf{X}^\mathsf{T}\mathbf{X})\) by adding the positive constant to the diagonal

What is \(E(\hat{\boldsymbol{\beta}}_{ridge})\)?
Is \(\hat{\boldsymbol{\beta}}_{ridge}\) unbiased?

Bias-variance trade-off

The penalty \(\lambda\boldsymbol{\beta}^\mathsf{T}\boldsymbol{\beta}\) makes the estimated coefficients in ridge regression less variable than least-squares
Therefore, by adding some bias, we are able to reduce variance. This is called the bias-variance trade-off
Reducing the variance in coefficients is particularly important when we wish to use a model for prediction on new data

See Section 1.4.2 of Lecture notes on ridge regression for details.

Fitting the model

We use standardized values of the predictors when fitting the ridge regression model
Why do we use standardized values? Let’s consider Length (in millimeters) in the abalone data
- Let \(\beta_j\) be the coefficient of Length. When finding the ridge estimates, the penalty for the coefficient of Length is \(\lambda \beta_j^2\) .
- Suppose instead of Length in millimeters, we use Length in meters. Then the new coefficient is \(1000\beta_j\). Thus, the penalty on the term would be \(\lambda 1000^2\beta_j^2\)
Standardizing the quantitative predictors ensures the penalty is fairly applied across all predictors (and not dependent on units)
Thus, shrinkage in the \(\beta_j\) reflects the predictor’s importance on the response