Inference for regression

Distribution of coefficients

Prof. Maria Tackett

February 05, 2026

Announcements

  • HW 02 due Thursday, February 12 at 11:59pm

    • Released after class
  • Exam 01 practice problems + lecture recordings posted on menu of course website

Topics

  • Derive distribution of \(\hat{\boldsymbol{\beta}}_j\)

Computing setup

# load packages
library(tidyverse)  
library(tidymodels)  
library(knitr)       
library(kableExtra)  
library(patchwork)   

# set default theme in ggplot2
ggplot2::theme_set(ggplot2::theme_bw())

Data: NCAA Football expenditures

Today’s data come from Equity in Athletics Data Analysis and includes information about sports expenditures and revenues for colleges and universities in the United States. This data set was featured in a March 2022 Tidy Tuesday.

We will focus on the 2019 - 2020 season expenditures on football for institutions in the NCAA - Division 1 FBS. The following variables are used in this analysis:

  • total_exp_m: Total expenditures on football in the 2019 - 2020 academic year (in millions USD)

  • enrollment_th: Total student enrollment in the 2019 - 2020 academic year (in thousands)

  • type: institution type (Public or Private)

football <- read_csv("data/ncaa-football-exp.csv")

Bivariate EDA

Regression model

exp_fit <- lm(total_exp_m ~ enrollment_th + type, data = football)
tidy(exp_fit) |>
  kable(digits = 3)
term estimate std.error statistic p.value
(Intercept) 19.332 2.984 6.478 0
enrollment_th 0.780 0.110 7.074 0
typePublic -13.226 3.153 -4.195 0
  • 0.780 is the exact relationship between football expenditure and enrollment for these 127 institutions in 2019-2020.

  • What if we want to say something about the relationship between these variables for all colleges and universities with football programs and across different years?

Statistical inference

  • Statistical inference provides methods and tools so we can use the single observed sample to make valid statements (inferences) about the population it comes from

  • For our inferences to be valid, the sample should be representative (ideally random) of the population we’re interested in

Image source: Eugene Morgan © Penn State

Inference framework

Our objective is to infer properties about a population using data from observational (or experimental) data collection

  • Pre data collection: Before collecting the data, the data are unknown and random. \(\hat{\boldsymbol{\beta}}\), which is a function of the data, is also unknown and random.

  • Post data collection: After collecting the data, \(\hat{\boldsymbol{\beta}}\) is fixed and known.

  • In all cases: The true population parameter, \(\boldsymbol{\beta}\) is fixed but unknown.

Question pre data collection: Is the probability distribution of \(\hat{\mathbf{\boldsymbol{\beta}}}\) a meaningful representation of the population? (We will slowly answer this question across the next few classes)

Linear regression model

\[\begin{aligned} \mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}, \hspace{8mm} \boldsymbol{\epsilon} \sim N(\mathbf{0}, \sigma^2_{\epsilon}\mathbf{I}) \end{aligned} \]

such that the errors are independent and normally distributed.

  • Independent: Knowing the error term for one observation doesn’t tell us about the error term for another observation
  • Normally distributed: The distribution follows a particular mathematical model that is unimodal and symmetric
    • \(E(\boldsymbol{\epsilon}) = \mathbf{0}\)
    • \(Var(\boldsymbol{\epsilon}) = \sigma^2_{\epsilon}\mathbf{I}\)

Visualizing distribution of \(\mathbf{y}|\mathbf{X}\)

\[ \mathbf{y}|\mathbf{X} \sim N(\mathbf{X}\boldsymbol{\beta}, \sigma_\epsilon^2\mathbf{I}) \]

Image source: Introduction to the Practice of Statistics (5th ed)

Last time we showed \(E(\mathbf{y}|\mathbf{X}) = \mathbf{X}\boldsymbol{\beta}\) and \(Var(\mathbf{y}|\mathbf{X}) = \sigma^2_{\epsilon}\mathbf{I}\)

Show \(\mathbf{y}\) is normally distributed

Linear transformation of normal random variables

Suppose \(\mathbf{z}\) is a multivariate normal random variable. Then, \(\mathbf{Az} + \mathbf{b}\) is (multivariate) normal for \(\mathbf{A}\) a constant matrix and \(\mathbf{b}\) a constant vector.


Show that the distribution of \(\mathbf{y}|\mathbf{X}\) is normal.

Assumptions for regression

\[ \mathbf{y}|\mathbf{X} \sim N(\mathbf{X}\boldsymbol{\beta}, \sigma_\epsilon^2\mathbf{I}) \]

Image source: Introduction to the Practice of Statistics (5th ed)
  1. Linearity: There is a linear relationship between the response and predictor variables.
  2. Equal variance: The variability about the least squares line is equal for all combinations of predictors.
  3. Normality: The distribution of the residuals is normal.
  4. Independence: The residuals are independent from one another.

Note

We will assume these hold for now and show how to check the assumptions after Exam 01.

Estimating \(\sigma^2_{\epsilon}\)

  • Once we fit the model, we can use the residuals to estimate \(\sigma_{\epsilon}^2\)

  • The estimated value \(\hat{\sigma}^2_{\epsilon}\) is needed for inference on the coefficients

\[ \hat{\sigma}^2_\epsilon = \frac{SSR}{n - p - 1} = \frac{\mathbf{e}^\mathsf{T}\mathbf{e}}{n-p-1} \]

  • The regression standard error \(\hat{\sigma}_{\epsilon}\) is a measure of the average distance between the observations and regression line

\[ \hat{\sigma}_\epsilon = \sqrt{\frac{SSR}{n - p - 1}} = \hat{\sigma}_\epsilon = \sqrt{\frac{\mathbf{e}^\mathsf{T}\mathbf{e}}{n - p - 1}} \]

Inference for a single coefficient

Inference for \(\beta_j\)

We often want to conduct inference on individual model coefficients

  • Hypothesis test: Is there evidence of a linear relationship between the response and \(x_j\)? \((\beta_j \neq 0 ? )\)

  • Confidence interval: What is a plausible range of values \(\beta_j\) can take?

But first we need to understand the distribution of \(\hat{\beta}_j\)

Sampling distribution of \(\hat{\boldsymbol{\beta}}\)

  • A sampling distribution is the probability distribution of a statistic computed from many repeated random samples of size \(n\) drawn from a population.

  • The sampling distribution of \(\hat{\boldsymbol{\beta}}\) is the probability distribution of the estimated coefficients, formed by repeatedly taking sample of size \(n\) and fitting the regression model to compute \(\hat{\boldsymbol{\beta}}\)

\[ \hat{\boldsymbol{\beta}} \sim N(\boldsymbol{\beta}, \sigma^2_\epsilon(\mathbf{X}^\mathsf{T}\mathbf{X})^{-1}) \]

The estimated coefficients \(\hat{\boldsymbol{\beta}}\) are normally distributed with

\[ E(\hat{\boldsymbol{\beta}}) = \boldsymbol{\beta} \hspace{10mm} Var(\hat{\boldsymbol{\beta}}) = \sigma^2_{\epsilon}(\boldsymbol{X}^\mathsf{T}\boldsymbol{X})^{-1} \]

Expected value and variance of \(\boldsymbol{\hat{\beta}}\)

Show

  • \(E(\hat{\boldsymbol{\beta}}) = \boldsymbol{\beta}\)

  • \(Var(\hat{\boldsymbol{\beta}}) = \sigma^2_{\epsilon}(\boldsymbol{X}^\mathsf{T}\boldsymbol{X})^{-1}\)


Will show that \(\hat{\boldsymbol{\beta}}\) is normally distributed in the homework.

Sampling distribution of \(\hat{\beta}_j\)

\[ \hat{\boldsymbol{\beta}} \sim N(\boldsymbol{\beta}, \sigma^2_\epsilon(\mathbf{X}^\mathsf{T}\mathbf{X})^{-1}) \]

Let \(\mathbf{C} = (\mathbf{X}^\mathsf{T}\mathbf{X})^{-1}\). Then, for each coefficient \(\hat{\beta}_j\),

  • \(E(\hat{\beta}_j) = \boldsymbol{\beta}_j\), the \(j^{th}\) element of \(\boldsymbol{\beta}\)

  • \(Var(\hat{\beta}_j) = \sigma^2_{\epsilon}C_{jj}\), where \(C_{jj}\) is the \(j^{th}\) diagonal element

  • \(Cov(\hat{\beta}_i, \hat{\beta}_j) = \sigma^2_{\epsilon}C_{ij}\)

Application exercise

Questions

🔗 https://forms.office.com/r/h4BdM85Bjf

Recap

  • Derived the distribution of the coefficients \(\hat{\boldsymbol{\beta}}\)

Next class

  • Hypothesis tests and confidence intervals for a coefficient \(\beta_j\)
  • Complete Lecture 09 prepare