# load packages
library(tidyverse)
library(tidymodels)
library(knitr)
library(kableExtra)
library(patchwork)
# set default theme in ggplot2
ggplot2::theme_set(ggplot2::theme_bw())Inference for regression
Cont’d
Announcements
Research topics due TODAY at 11:59pm on GitHub
HW 02 due Thursday, February 12 at 11:59pm
Exam 01
- Exam 01 practice
- Math rules (will be provided on exam)
- Lecture recordings
- Prepare readings (see course schedule)
- Lecture notes
- AEs
- Lab and HW assignments
Topics
- Inference for a single coefficient
Computing setup
Data: NCAA Football expenditures
Today’s data come from Equity in Athletics Data Analysis and includes information about sports expenditures and revenues for colleges and universities in the United States. This data set was featured in a March 2022 Tidy Tuesday.
We will focus on the 2019 - 2020 season expenditures on football for institutions in the NCAA - Division 1 FBS. The variables are :
total_exp_m: Total expenditures on football in the 2019 - 2020 academic year (in millions USD)enrollment_th: Total student enrollment in the 2019 - 2020 academic year (in thousands)type: institution type (Public or Private)
football <- read_csv("data/ncaa-football-exp.csv")Regression model
exp_fit <- lm(total_exp_m ~ enrollment_th + type, data = football)
tidy(exp_fit) |>
kable(digits = 3)| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 19.332 | 2.984 | 6.478 | 0 |
| enrollment_th | 0.780 | 0.110 | 7.074 | 0 |
| typePublic | -13.226 | 3.153 | -4.195 | 0 |
Inference for a single coefficient
Inference for \(\beta_j\)
We often want to conduct inference on individual model coefficients
Hypothesis test: Is there a linear relationship between the response and \(x_j\)?
Confidence interval: What is a plausible range of values \(\beta_j\) can take?
Sampling distribution of \(\hat{\beta}\)
A sampling distribution is the probability distribution of a statistic for a large number of random samples of size \(n\) from a population
The sampling distribution of \(\hat{\boldsymbol{\beta}}\) is the probability distribution of the estimated coefficients if we repeatedly took samples of size \(n\) and fit the regression model
\[ \hat{\boldsymbol{\beta}} \sim N(\boldsymbol{\beta}, \sigma^2_\epsilon(\mathbf{X}^\mathsf{T}\mathbf{X})^{-1}) \]
The estimated coefficients \(\hat{\boldsymbol{\beta}}\) are normally distributed with
\[ E(\hat{\boldsymbol{\beta}}) = \boldsymbol{\beta} \hspace{13mm} Var(\hat{\boldsymbol{\beta}}) = \sigma^2_{\epsilon}(\boldsymbol{X}^\mathsf{T}\boldsymbol{X})^{-1} \]
Sampling distribution of \(\hat{\beta}_j\)
\[ \hat{\boldsymbol{\beta}} \sim N(\boldsymbol{\beta}, \sigma^2_\epsilon(\mathbf{X}^\mathsf{T}\mathbf{X})^{-1}) \]
Let \(\mathbf{C} = (\mathbf{X}^\mathsf{T}\mathbf{X})^{-1}\). Then, for each coefficient \(\hat{\beta}_j\),
\(E(\hat{\beta}_j) = \boldsymbol{\beta}_j\), the \(j^{th}\) element of \(\boldsymbol{\beta}\)
\(Var(\hat{\beta}_j) = \sigma^2_{\epsilon}C_{jj}\)
\(Cov(\hat{\beta}_i, \hat{\beta}_j) = \sigma^2_{\epsilon}C_{ij}\)
Hypothesis test for \(\beta_j\)
Hypothesis test for \(\beta_j\): Hypotheses
Null: There is no linear relationship between institution type and football expenditure, after adjusting for enrollment \(H_0: \beta_j = 0\)
Alternative: There is a linear relationship between institution type and football expenditure, after adjusting for enrollment \(H_a: \beta_j \neq 0\)
Hypothesis test as US court trial
Null hypothesis, \(H_0\) : Defendant is innocent
Alternative hypothesis, \(H_a\) : Defendant is guilty
Present the evidence: Collect data
Judge the evidence: “Could these data plausibly have happened by chance if the null hypothesis were true?”
Yes: Fail to reject \(H_0\)
No: Reject \(H_0\)
Steps for a hypothesis test
- State the null and alternative hypotheses.
- Calculate a test statistic.
- Calculate the p-value.
- State the conclusion.
Let’s walk through the steps to test \(\beta_j\), the coefficient for typePublic .
Hypothesis test for \(\beta_j\): Test statistic
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 19.332 | 2.984 | 6.478 | 0 |
| enrollment_th | 0.780 | 0.110 | 7.074 | 0 |
| typePublic | -13.226 | 3.153 | -4.195 | 0 |
Test statistic: Number of standard errors the estimate is away from the null
\[ \text{Test Statistic} = \frac{\text{Estimate - Null}}{\text{Standard error}} = \frac{-13.226 - 0}{3.153} = -4.195 \\ \]
. . .
This means the estimated slope of -13.226 is 4.195 standard errors below the hypothesized mean of 0.
Hypothesis test for \(\beta_j\): p-value
- The test statistic follows a \(t\) distribution with 124 degrees of freedom.
\[ \text{p-value} = P(|T| > |-4.195|) \]
. . .
2 * pt(4.195, df = nrow(football) - 2 - 1, lower.tail = FALSE)[1] 0.00005153923
. . .
Given \(\beta_j = 0\) ( \(H_0\) is true), the probability of observing a slope of -13.226 or more extreme is \(\approx 0\) .
Hypothesis test for \(\beta_j\): Conclusion
The p-value is \(\approx 0\), so we reject \(H_0\).
The data provide sufficient evidence that \(\beta_j \neq 0\), meaning evidence there is a linear relationship between institution type and football expenditure, after adjusting for enrollment.
Confidence interval for \(\beta_j\)
Confidence interval for \(\beta_j\)
A plausible range of values for a population parameter is called a confidence interval
Using only a single point estimate is like fishing in a murky lake with a spear, and using a confidence interval is like fishing with a net
We can throw a spear where we saw a fish but we will probably miss, if we toss a net in that area, we have a good chance of catching the fish
Similarly, if we report a point estimate, we probably will not hit the exact population parameter, but if we report a range of plausible values we have a good shot at capturing the parameter
What “confidence” means
We will construct \(C\%\) confidence intervals.
- The confidence level impacts the width of the interval
“Confident” means if we were to take repeated samples of the same size as our data, fit regression lines using the same predictors, and calculate \(C\%\) for the coefficient of \(x_j\), then \(C\%\) of those intervals will contain the true value of the coefficient \(\beta_j\).
Balance precision and accuracy when selecting a confidence level
. . .
Pre-data collection, \(Pr(\beta_j \text{ in } 95\% \text{ CI })\)?
Post-data collection, \(Pr(\beta_j \text{ in } 95\% \text{ CI })\)?
Confidence interval for \(\beta_j\)
\[ \text{Estimate} \pm \text{ (critical value) } \times \text{SE} \]
. . .
\[ \hat{\beta}_1 \pm t^* \times SE({\hat{\beta}_j}) \]
where \(t^*\) is calculated from a \(t\) distribution with \(n-p-1\) degrees of freedom
Confidence interval: Critical value
# confidence level: 95%
qt(0.975, df = nrow(football) - 2 - 1)[1] 1.97928
# confidence level: 90%
qt(0.95, df = nrow(football) - 2 - 1)[1] 1.657235
# confidence level: 99%
qt(0.995, df = nrow(football) - 2 - 1)[1] 2.61606
95% CI for \(\beta_j\): Calculation
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 19.332 | 2.984 | 6.478 | 0 |
| enrollment_th | 0.780 | 0.110 | 7.074 | 0 |
| typePublic | -13.226 | 3.153 | -4.195 | 0 |
95% CI for \(\beta_j\) in R
tidy(exp_fit, conf.int = TRUE, conf.level = 0.95) |>
kable(digits = 3)| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 19.332 | 2.984 | 6.478 | 0 | 13.426 | 25.239 |
| enrollment_th | 0.780 | 0.110 | 7.074 | 0 | 0.562 | 0.999 |
| typePublic | -13.226 | 3.153 | -4.195 | 0 | -19.466 | -6.986 |
Interpretation: We are 95% confident that for each additional 1,000 students enrolled, the institution’s expenditures on football will be greater by $562,000 to $999,000, on average, holding institution type constant.
Application exercise
Recap
Conducted hypothesis tests for a single coefficient \(\beta_j\)
Computed and interpreted confidence intervals for a single coefficient \(\beta_j\)
Next class
- Exam 01 review