Model selection

Author

Prof. Maria Tackett

Published

Mar 17, 2026

Announcements

HW 03 due March 19 at 11:59pm
Project exploratory data analysis due March 19
Statistics experience due April 2
DataFest March 20 - 22: https://dukestatsci.github.io/datafest/

Mid-semester feedback

What is helping your learning:
- Attending lectures
- Working on assignments with others
- Getting help during labs and office hours
What you want more of to help with learning
- More practice - both theory and applied exercises
- Review exercises in lab

Topics

Recap: Properties of \(\hat{\boldsymbol{\beta}}\)
Model selection

Properties of \(\hat{\boldsymbol{\beta}}\)

Finite sample properties

The least-squares estimator has two useful finite-sample properties

Unbiased estimator

\[ E(\hat{\boldsymbol{\beta}}) = \boldsymbol{\beta} \]
Best Linear Unbiased Estimator (BLUE)
- “Best” meaning the lowest variance among the class of linear unbiased estimators
- Proved the Gauss-Markov Theorem

Asymptotic properties

The least-squares estimator inherits asymptotic properties of maximum likelihood estimators when \(\boldsymbol{\epsilon} \sim N(\mathbf{0}, \sigma^2_{\epsilon}\mathbf{I})\)

Consistency As \(n \rightarrow \infty\), \(\hat{\boldsymbol{\beta}}\) will be arbitrarily close to \(\boldsymbol{\beta}\) with high probability

\[ \displaystyle \lim_{n\to\infty} P(|\hat{\boldsymbol{\beta}}_n - \boldsymbol{\beta}| \geq c) = 0 \]

Proved based on theorem by showing
- \(\lim_{n \to \infty} Var(\hat{\boldsymbol{\beta}}) = 0\)
- \(\lim_{n \to \infty} Bias(\hat{\boldsymbol{\beta}}) = 0\)

Asymptotic properties

The least-squares estimator inherits asymptotic properties of maximum likelihood estimators when \(\boldsymbol{\epsilon} \sim N(\mathbf{0}, \sigma^2_{\epsilon}\mathbf{I})\)

Efficiency: Lowest variance among class of estimators
- Lowest variance among all unbiased estimators when
Asymptotic normality: When \(n\) is large, distribution of \(\hat{\boldsymbol{\beta}}\) is normal regardless of the distribution of the underlying data

Model selection

Model selection goals

The principle of parsimony is attributed to William of Occam (early 14th-century English nominalist philosopher), who insisted that, given a set of equally good explanations for a given phenomenon, the correct explanation is the simplest explanation¹
Called Occam’s razor because he “shaved” his explanations down to the bare minimum
Parsimony in modeling:
- models should have as few parameters as possible
- linear models should be preferred to non-linear models
- experiments relying on few assumptions should be preferred to those relying on many
- models should be pared down until they are minimal adequate
- simple explanations should be preferred to complex explanations

In pursuit of Occam’s razor

Occam’s razor states that among competing hypotheses that predict equally well, the one with the fewest assumptions should be selected
Model selection follows this principle
We only want to add another variable to the model if the addition of that variable brings something valuable in terms of predictive power to the model
In other words, we prefer the simplest best model, i.e. parsimonious model

Alternate views

Sometimes a simple model will outperform a more complex model . . . Nevertheless, I believe that deliberately limiting the complexity of the model is not fruitful when the problem is evidently complex. Instead, if a simple model is found that outperforms some particular complex model, the appropriate response is to define a different complex model that captures whatever aspect of the problem led to the simple model performing well.

Radford Neal - Bayesian Learning for Neural Networks²

Model selection statistics

You can use multiple statistics to help inform model selection. If the statistics suggest different models, use the primary analysis objective to determine how to proceed.

When primary analysis objective is explanation, prioritize
- Adj. \(R^2\)
- Akaike’s Information Criterion (AIC)
When primary analysis objective is prediction, prioritize
- Root Mean Square Error (RMSE)
- Bayesian Information Criterion (BIC)

AIC & BIC

Akaike’s Information Criterion (AIC): \[AIC = -2 \log L + 2 \times (p + 1) \]

Bayesian Information Criterion (BIC) \[BIC = -2 \log L + \log(n) \times (p + 1)\]

where \(\log L\) is the log-likelihood evaluated at the maximum likelihood estimate, \((p + 1)\) is the number of terms in the model, and \(n\) is the sample size

AIC & BIC

\[\begin{aligned} & AIC = -2 \log L + 2 \times (p + 1) \\ & BIC =-2 \log L + \log(n) \times (p + 1)\end{aligned}\]

. . .

First Term: Generally decreases as p increases
Second term (the penalty): Increases as \(p\) increases

Do we prefer smaller or larger AIC? BIC?

Using AIC & BIC

\[\begin{aligned} & AIC = -2 \log L + 2 \times (p + 1) \\ & BIC =-2 \log L + \log(n) \times (p + 1)\end{aligned}\]

Choose model with the smaller value of AIC or BIC
- See Table 10.3 in Introduction to Regression Analysis for comparison guidelines
Penalty for BIC increases with the sample size

Penalty for BIC is greater than the penalty for AIC when \(n >\) ____ ?

Application exercise

📋 sta221-sp26.github.io/ae/ae-05-model-selection.html

Next class

Ridge regression
No prepare assignment

Footnotes

Source: The R Book by Michael J. Crawley↩︎
Suggested blog post: Occam by Andrew Gelman↩︎