SLR: Matrix representation
Announcements
Lab 01 due on TODAY at 11:59pm
- Push work to GitHub repo
- Submit final PDF on Gradescope + mark pages for each question
HW 01 released today - due Thursday, January 29 at 11:59pm
- Note: AI Disclosure
Topics
- Matrix representation of least-squares regression
- Model form
- Least square estimate
- Predicted (fitted) values
- Residuals
- Geometry of least-squares regression
Matrix representation of least-squares regression
SLR: Statistical model (population)
When we have a quantitative response, \(Y\), and a single quantitative predictor, \(X\), we can use a simple linear regression model to describe the relationship between \(Y\) and \(X\).
\[Y = \beta_0 + \beta_1 X + \epsilon\]
- \(\beta_1\): Population (true) slope of the relationship between \(X\) and \(Y\)
- \(\beta_0\): Population (true) intercept of the relationship between \(X\) and \(Y\)
- \(\epsilon\): Error terms centered at 0 with variance \(\sigma^2_{\epsilon}\)
SLR in matrix form
The simple linear regression model can be represented using vectors and matrices as
\[ \large{\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}} \]
\(\mathbf{y}\) : Vector of responses
\(\mathbf{X}\): Design matrix (columns for predictors + intercept)
\(\boldsymbol{\beta}\): Vector of model coefficients
\(\boldsymbol{\epsilon}\): Vector of error terms centered at \(\mathbf{0}\) with variance \(\sigma^2_{\epsilon}\mathbf{I}\)
SLR in matrix form
\[ \underbrace{ \begin{bmatrix} y_1 \\ \vdots \\ y_n \end{bmatrix} }_ {\mathbf{y}} \hspace{3mm} = \hspace{3mm} \underbrace{ \begin{bmatrix} 1 &x_1 \\ \vdots & \vdots \\ 1 & x_n \end{bmatrix} }_{\mathbf{X}} \hspace{2mm} \underbrace{ \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix} }_{\boldsymbol{\beta}} \hspace{3mm} + \hspace{3mm} \underbrace{ \begin{bmatrix} \epsilon_1 \\ \vdots\\ \epsilon_n \end{bmatrix} }_\boldsymbol{\epsilon} \]
What are the dimensions of \(\mathbf{y}\), \(\mathbf{X}\), \(\boldsymbol{\beta}\), and \(\boldsymbol{\epsilon}\)?
Find least-squares estimator for \(\boldsymbol{\beta}\)
Goal: Find estimator \(\hat{\boldsymbol{\beta}}= \begin{bmatrix}\hat{\beta}_0 \\ \hat{\beta}_1 \end{bmatrix}\) that minimizes the sum of squares \[ \sum_{i=1}^n \epsilon_i^2 = \boldsymbol{\epsilon}^\mathsf{T}\boldsymbol{\epsilon} = (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\mathsf{T}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) \]
Gradient
Let \(\mathbf{x} = \begin{bmatrix}x_1 \\ x_2 \\ \vdots \\x_k\end{bmatrix}\)be a \(k \times 1\) vector and \(f(\mathbf{x})\) be a function of \(\mathbf{x}\).
. . .
Then the gradient of \(f\) with respect to \(\mathbf{x}\) is
\[\frac{\partial f}{\partial \mathbf{x}} = \begin{bmatrix}\frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_k}\end{bmatrix} \]
Property 1: Derivative of inner product
Let \(\mathbf{x}\) be a \(k \times 1\) vector and \(\mathbf{z}\) be a \(k \times 1\) vector, such that \(\mathbf{z}\) is not a function of \(\mathbf{x}\) .
The derivative of \(\mathbf{x}^\mathsf{T}\mathbf{z}\) with respect to \(\mathbf{x}\) is
\[ \frac{\partial}{\partial \mathbf{x}}\hspace{1mm} \mathbf{x}^\mathsf{T}\mathbf{z} = \frac{\partial}{\partial \mathbf{x}} \hspace{1mm} \mathbf{z}^\mathsf{T}\mathbf{x} = \mathbf{z} \]
Property 1: Derivative of inner product
\[\begin{aligned} \mathbf{x}^\mathsf{T}\mathbf{z} &= \class{fragment}{\begin{bmatrix}x_1 & x_2 & \dots &x_k\end{bmatrix} \begin{bmatrix}z_1 \\ z_2 \\ \vdots \\z_k\end{bmatrix}} \\[10pt] &\class{fragment}{= x_1z_1 + x_2z_2 + \dots + x_kz_k} \\ &\class{fragment}{= \sum_{i=1}^k x_iz_i} \end{aligned}\](This is equivalent to \(\mathbf{z}^\mathsf{T}\mathbf{x}\))
Property 1: Derivative of inner product
\[ \frac{\partial}{\partial \mathbf{x}}\hspace{1mm}\mathbf{x}^\mathsf{T}\mathbf{z} = \class{fragment}{\begin{bmatrix}\frac{\partial \mathbf{x}^\mathsf{T}\mathbf{z}}{\partial x_1} \\ \frac{\partial \mathbf{x}^\mathsf{T}\mathbf{z}}{\partial x_2} \\ \vdots \\ \frac{\partial \mathbf{x}^\mathsf{T}\mathbf{z}}{\partial x_k}\end{bmatrix}} = \class{fragment}{\begin{bmatrix}\frac{\partial}{\partial x_1} (x_1z_1 + x_2z_2 + \dots + x_kz_k) \\ \frac{\partial}{\partial x_2} (x_1z_1 + x_2z_2 + \dots + x_kz_k)\\ \vdots \\ \frac{\partial}{\partial x_k} (x_1z_1 + x_2z_2 + \dots + x_kz_k)\end{bmatrix}} = \class{fragment}{\begin{bmatrix} z_1 \\ z_2 \\ \vdots \\ z_k\end{bmatrix} = \mathbf{z}} \]
Property 2: Derivative of quadratic form
Let \(\mathbf{x}\) be a \(k \times 1\) vector and \(\mathbf{A}\) be a \(k \times k\) matrix, such that \(\mathbf{A}\) is not a function of \(\mathbf{x}\) .
Then the derivative of \(\mathbf{x}^\mathsf{T}\mathbf{A}\mathbf{x}\) with respect to \(\mathbf{x}\) is
\[ \frac{\partial}{\partial \mathbf{x}} \hspace{1mm} \mathbf{x}^\mathsf{T}\mathbf{A}\mathbf{x} = (\mathbf{A}\mathbf{x} + \mathbf{A}^\mathsf{T} \mathbf{x}) = (\mathbf{A} + \mathbf{A}^\mathsf{T})\mathbf{x} \]
If \(\mathbf{A}\) is symmetric, then
\[ (\mathbf{A} + \mathbf{A}^\mathsf{T})\mathbf{x} = 2\mathbf{A}\mathbf{x} \]
See The Matrix Cookbook for more on matrix operations.
Find the least-squares estimator
Find \(\hat{\boldsymbol{\beta}}\) that minimizes
\[ \begin{aligned} \boldsymbol{\epsilon}^\mathsf{T}\boldsymbol{\epsilon} &= (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\mathsf{T}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) \\[10pt] &= (\mathbf{y}^\mathsf{T} - \boldsymbol{\beta}^\mathsf{T}\mathbf{X}^\mathsf{T})(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})\\[10pt] &=\mathbf{y}^\mathsf{T}\mathbf{y} - \mathbf{y}^\mathsf{T}\mathbf{X}\boldsymbol{\beta} - \boldsymbol{\beta}^\mathsf{T}\mathbf{X}^\mathsf{T}\mathbf{y} + \boldsymbol{\beta}^\mathsf{T}\mathbf{X}^\mathsf{T}\mathbf{X}\boldsymbol{\beta}\\[10pt] &=\mathbf{y}^\mathsf{T}\mathbf{y} - 2\boldsymbol{\beta}^\mathsf{T}\mathbf{X}^\mathsf{T}\mathbf{y} + \boldsymbol{\beta}^\mathsf{T}\mathbf{X}^\mathsf{T}\mathbf{X}\boldsymbol{\beta} \end{aligned} \]
Find the least squares estimator
\[\begin{aligned}
\frac{\partial}{\partial\boldsymbol{\beta}} \hspace{1mm} \boldsymbol{\epsilon}^\mathsf{T}\boldsymbol{\epsilon} &= \frac{\partial}{\partial\boldsymbol{\beta}}( \mathbf{y}^\mathsf{T}\mathbf{y} - 2\boldsymbol{\beta}^\mathsf{T}\mathbf{X}^\mathsf{T}\mathbf{y} + \boldsymbol{\beta}^\mathsf{T}\mathbf{X}^\mathsf{T}\mathbf{X}\boldsymbol{\beta}) \\[10pt]
& = -2\mathbf{X}^\mathsf{T}\mathbf{y} + 2\mathbf{X}^\mathsf{T}\mathbf{X}\boldsymbol{\beta}
\end{aligned}
\]
Find \(\hat{\boldsymbol{\beta}}\) that satisfies
\[ -2\mathbf{X}^\mathsf{T}\mathbf{y} + 2\mathbf{X}^\mathsf{T}\mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{0} \]
\[\hat{\boldsymbol{\beta}} = (\mathbf{X}^\mathsf{T}\mathbf{X})^{-1}\mathbf{X}^\mathsf{T}\mathbf{y}\]
Did we find a minimum?
Hessian matrix
The Hessian matrix is a square matrix of partial second derivatives
\[ \frac{\partial^2 f}{\partial \mathbf{x}^2} = \begin{bmatrix} \frac{\partial^2f}{\partial x_1^2} & \frac{\partial^2f}{\partial x_1 \partial x_2} & \dots & \frac{\partial^2f}{\partial x_1\partial x_k} \\ \frac{\partial^2f}{\partial\ x_2 \partial x_1} & \frac{\partial^2f}{\partial x_2^2} & \dots & \frac{\partial^2f}{\partial x_2 \partial x_k} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2f}{\partial x_k\partial x_1} & \frac{\partial^2f}{\partial x_k\partial x_2} & \dots & \frac{\partial^2f}{\partial x_k^2} \end{bmatrix} \]
Using the Hessian matrix
If the Hessian matrix is…
positive-definite, then we have found a minimum.
negative-definite, then we have found a maximum.
neither positive or negative-definite, then we have found a saddle point
Did we find a minimum?
\[ \begin{aligned} \frac{\partial^2}{\partial \boldsymbol{\beta}^2} \hspace{1mm} \boldsymbol{\epsilon}^\mathsf{T}\boldsymbol{\epsilon} &= \frac{\partial}{\partial \boldsymbol{\beta}} (-2\mathbf{X}^\mathsf{T}\mathbf{y} + 2\mathbf{X}^\mathsf{T}\mathbf{X}\boldsymbol{\beta}) \\[10pt] &{=-2\frac{\partial}{\partial \boldsymbol{\beta}} (\mathbf{X}^\mathsf{T}\mathbf{y}) + 2\frac{\partial}{\partial \boldsymbol{\beta}} (\mathbf{X}^\mathsf{T}\mathbf{X}\boldsymbol{\beta})} \\[10pt] &{= 2 \mathbf{X}^\mathsf{T}\mathbf{X}} \end{aligned} \]
Show that \(2\mathbf{X}^\mathsf{T}\mathbf{X}\) is positive definite in HW 01.
Positive (semi-)definite matrices
A matrix \(\mathbf{A}\) is positive definite if
\[ \mathbf{x}^\mathsf{T}\mathbf{A}\mathbf{x} > 0 \hspace{8mm} \text{for all } \mathbf{x} \neq \mathbf{0} \]
A matrix \(\mathbf{A}\) is positive semi-definite if
\[ \mathbf{x}^\mathsf{T}\mathbf{A}\mathbf{x} \geq 0 \hspace{8mm} \text{for all } \mathbf{x} \]
Equivalently:
- \(\mathbf{A}\) is positive definitive if all eigenvalues are positive
- \(\mathbf{A}\) is positive semi-definite if all eigenvalues are non-negative.
Geometry of least-squares regression
Geometry of least-squares regression
Let \(\text{Col}(\mathbf{X})\) be the column space of \(\mathbf{X}\): the set all possible linear combinations (span) of the columns of \(\mathbf{X}\)
The vector of responses \(\mathbf{y}\) is not in \(\text{Col}(\mathbf{X})\).
Goal: Find another vector \(\mathbf{z} = \mathbf{X}\mathbf{\boldsymbol{\beta}}\) that is in \(\text{Col}(\mathbf{X})\) and is as close as possible to \(\mathbf{y}\).
- \(\mathbf{z}\) is a projection of \(\mathbf{y}\) onto \(\text{Col}(\mathbf{X})\) .
Geometry of least-squares regression
For any \(\mathbf{z} = \mathbf{X}\boldsymbol{\beta}\) in \(\text{Col}(\mathbf{X})\), the vector \(\boldsymbol{\epsilon} = \mathbf{y} - \mathbf{Xb}\) is the difference between \(\mathbf{y}\) and \(\mathbf{X}\boldsymbol{\beta}\).
- We want to find \(\boldsymbol{\beta}\) such that \(\mathbf{z} = \mathbf{X}\boldsymbol{\beta}\) is as close as possible to \(\mathbf{y}\), i.e, we want to minimize the difference \(\boldsymbol{\epsilon} = \mathbf{y} - \mathbf{X}\boldsymbol{\beta}\)
This distance is minimized when \(\mathbf{e}\) is orthogonal to \(\text{Col}(\mathbf{X})\)
Geometry of least-squares regression
Note: If \(\mathbf{A}\), an \(n \times k\) matrix, is orthogonal to an \(n \times 1\) vector \(\mathbf{c}\), then \(\mathbf{A}^\mathsf{T}\mathbf{c} = \mathbf{0}\)
Therefore, we have \(\mathbf{X}^\mathsf{T}\boldsymbol{\epsilon} = \mathbf{0}\) , and thus
\[ \mathbf{X}^\mathsf{T}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) = \mathbf{0} \]
Find \(\boldsymbol{\beta}\) that satisfies this equation.
Predicted (fitted) values
Now that we have \(\hat{\boldsymbol{\beta}}\), let’s predict values of \(\mathbf{y}\) using the model
\[ \hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}} = \underbrace{\mathbf{X}(\mathbf{X}^\mathsf{T}\mathbf{X})^{-1}\mathbf{X}^\mathsf{T}}_{\mathbf{H}}\mathbf{y} = \mathbf{H}\mathbf{y} \]
. . .
Hat matrix: \(\mathbf{H} = \mathbf{X}(\mathbf{X}^\mathsf{T}\mathbf{X})^{-1}\mathbf{X}^\mathsf{T}\)
Hat matrix
\(\hat{\mathbf{y}} = \mathbf{Hy}\), so \(\mathbf{H}\) is a projection of \(\mathbf{y}\) onto \(\mathbf{X}\boldsymbol{\beta}\)
Properties of \(\mathbf{H}\), a projection matrix
\(\mathbf{H}\) is symmetric (\(\mathbf{H}^\mathsf{T} = \mathbf{H}\))
\(\mathbf{H}\) is idempotent (\(\mathbf{H}^2 = \mathbf{H}\))
If \(\mathbf{v}\) in \(\text{Col}(\mathbf{X})\), then \(\mathbf{Hv} = \mathbf{v}\)
If \(\mathbf{v}\) is orthogonal to \(\text{Col}(\mathbf{X})\), then \(\mathbf{Hv} = \mathbf{0}\)
Show these properties in HW 01 and HW 02.
Residuals
Recall that the residuals are the difference between the observed and predicted values
\[ \begin{aligned} \mathbf{e} &= \mathbf{y} - \hat{\mathbf{y}}\\[10pt] &\class{fragment}{ = \mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}} \\[10pt] &\class{fragment}{ = \mathbf{y} - \mathbf{H}\mathbf{y}} \\[20pt] \class{fragment}{\color{#993399}{\mathbf{e}}} &\class{fragment}{\color{#993399}{=(\mathbf{I} - \mathbf{H})\mathbf{y}}} \\[10pt] \end{aligned} \]
Recap
Introduced matrix representation for simple linear regression
- Model form
- Least square estimate
- Predicted (fitted) values
- Residuals
Introduced the geometric interpretation of least-squares regression
For next class
Multiple linear regression
Complete Lecture 05 prepare