Model selection

Author

Prof. Maria Tackett

Published

Mar 17, 2026

Announcements

Mid-semester feedback

  • What is helping your learning:

    • Attending lectures

    • Working on assignments with others

    • Getting help during labs and office hours

  • What you want more of to help with learning

    • More practice - both theory and applied exercises

    • Review exercises in lab

Topics

  • Recap: Properties of \(\hat{\boldsymbol{\beta}}\)
  • Model selection

Properties of \(\hat{\boldsymbol{\beta}}\)

Finite sample properties

The least-squares estimator has two useful finite-sample properties

  • Unbiased estimator

    \[ E(\hat{\boldsymbol{\beta}}) = \boldsymbol{\beta} \]

  • Best Linear Unbiased Estimator (BLUE)

    • “Best” meaning the lowest variance among the class of linear unbiased estimators

    • Proved the Gauss-Markov Theorem

Asymptotic properties

The least-squares estimator inherits asymptotic properties of maximum likelihood estimators when \(\boldsymbol{\epsilon} \sim N(\mathbf{0}, \sigma^2_{\epsilon}\mathbf{I})\)

Consistency As \(n \rightarrow \infty\), \(\hat{\boldsymbol{\beta}}\) will be arbitrarily close to \(\boldsymbol{\beta}\) with high probability

\[ \displaystyle \lim_{n\to\infty} P(|\hat{\boldsymbol{\beta}}_n - \boldsymbol{\beta}| \geq c) = 0 \]

  • Proved based on theorem by showing

    • \(\lim_{n \to \infty} Var(\hat{\boldsymbol{\beta}}) = 0\)

    • \(\lim_{n \to \infty} Bias(\hat{\boldsymbol{\beta}}) = 0\)

Asymptotic properties

The least-squares estimator inherits asymptotic properties of maximum likelihood estimators when \(\boldsymbol{\epsilon} \sim N(\mathbf{0}, \sigma^2_{\epsilon}\mathbf{I})\)

  • Efficiency: Lowest variance among class of estimators

    • Lowest variance among all unbiased estimators when
  • Asymptotic normality: When \(n\) is large, distribution of \(\hat{\boldsymbol{\beta}}\) is normal regardless of the distribution of the underlying data

Model selection

Model selection goals

  • The principle of parsimony is attributed to William of Occam (early 14th-century English nominalist philosopher), who insisted that, given a set of equally good explanations for a given phenomenon, the correct explanation is the simplest explanation1

  • Called Occam’s razor because he “shaved” his explanations down to the bare minimum

  • Parsimony in modeling:

    • models should have as few parameters as possible

    • linear models should be preferred to non-linear models

    • experiments relying on few assumptions should be preferred to those relying on many

    • models should be pared down until they are minimal adequate

    • simple explanations should be preferred to complex explanations

In pursuit of Occam’s razor

  • Occam’s razor states that among competing hypotheses that predict equally well, the one with the fewest assumptions should be selected

  • Model selection follows this principle

  • We only want to add another variable to the model if the addition of that variable brings something valuable in terms of predictive power to the model

  • In other words, we prefer the simplest best model, i.e. parsimonious model

Alternate views

Sometimes a simple model will outperform a more complex model . . . Nevertheless, I believe that deliberately limiting the complexity of the model is not fruitful when the problem is evidently complex. Instead, if a simple model is found that outperforms some particular complex model, the appropriate response is to define a different complex model that captures whatever aspect of the problem led to the simple model performing well.


Radford Neal - Bayesian Learning for Neural Networks2

Model selection statistics

You can use multiple statistics to help inform model selection. If the statistics suggest different models, use the primary analysis objective to determine how to proceed.

  • When primary analysis objective is explanation, prioritize

    • Adj. \(R^2\)

    • Akaike’s Information Criterion (AIC)

  • When primary analysis objective is prediction, prioritize

    • Root Mean Square Error (RMSE)

    • Bayesian Information Criterion (BIC)

AIC & BIC

Akaike’s Information Criterion (AIC): \[AIC = -2 \log L + 2 \times (p + 1) \]

Bayesian Information Criterion (BIC) \[BIC = -2 \log L + \log(n) \times (p + 1)\]

where \(\log L\) is the log-likelihood evaluated at the maximum likelihood estimate, \((p + 1)\) is the number of terms in the model, and \(n\) is the sample size

AIC & BIC

\[\begin{aligned} & AIC = -2 \log L + 2 \times (p + 1) \\ & BIC =-2 \log L + \log(n) \times (p + 1)\end{aligned}\]

. . .


  • First Term: Generally decreases as p increases

  • Second term (the penalty): Increases as \(p\) increases

Do we prefer smaller or larger AIC? BIC?

Using AIC & BIC

\[\begin{aligned} & AIC = -2 \log L + 2 \times (p + 1) \\ & BIC =-2 \log L + \log(n) \times (p + 1)\end{aligned}\]

  • Choose model with the smaller value of AIC or BIC

  • Penalty for BIC increases with the sample size

    Penalty for BIC is greater than the penalty for AIC when \(n >\) ____ ?

Application exercise

Next class

  • Ridge regression

  • No prepare assignment

Footnotes

  1. Source: The R Book by Michael J. Crawley↩︎

  2. Suggested blog post: Occam by Andrew Gelman↩︎