AE 05: Model selection

Published

March 17, 2026

Important

Go to the course GitHub organization and locate your ae-05 repo to get started.

library(tidyverse)
library(knitr)
library(tidymodels)

nba_salaries <- read_csv("data/nba-salaries-2022-23.csv") 

Introduction

Today’s dataset contains player information and statistics for professional NBA players in the 2022 - 2023 season. It is a subset of the NBA Player Salaries (2022-23 Season) dataset on Kaggle. The data are originally from the websites Hoopshype and Basketball Reference.

The goal of the analysis is to use player features and statistics to understand variability in salary.

Response

  • Salary: salary in millions of US dollars

Predictors

  • Position: player position
  • Age: age of player
  • GP: total number of games played in the season
  • MP: average minutes played per game
  • GS: number of games the player is put in at the start of the game
  • FG: average number of field goals (shots) attempted per game
  • FG%: percentage of shots made
  • 2PA: average number of 2-point shots attempted per game
  • 2P%: percentage of 2-point shots made
  • AST: average number of assists per game
  • AST%: percentage of teammate field goals a player assisted on while they were on the floor

Exercise 1

We are interested in comparing the salaries of players in the point guard position (PG) versus all other positions. Create a new indicator called PG that is TRUE if the player is a point guard and FALSE otherwise.

Use PG instead of Position for the remainder of the AE.

Exercise 2

We will consider an interaction effect between Age and PG. Make a visualization that can be used to explore this potential effect. Based on the visualization, does the effect of Age appear to differ by PG?

Exercise 3

Fit the main effects model using all predictors in the data to explain variability in Salary.

Tip

As a short cut, remove Player Name and Position from the data and use the syntax Salary ~ . in lm(). The . means to input all columns (except Salary) as predictors.

Exercise 4

Use the glance() function to compute AIC and BIC for the model in the previous exercise.

Exercise 5

Using the output from Exercise 3, fit a second model that includes only the predictors with p-values below 0.05, along with Age, PG, and the interaction between Age and PG.

Exercise 6

  • Compute AIC and BIC for the model from the previous exercise.

  • Which model do you choose based on AIC - model from Exercise 3 or Exercise 5?

  • Which model do you choose based on BIC?

Exercise 7

“Manually” compute AIC and BIC for the model selected in the previous exercise.

Tip

\(\log L\) can be obtained from glance(model)$loglik

Wrapping up

Important

Once you’ve completed the AE:

  • Render the document to produce the PDF with all of your work from today’s class.
  • Push all your work to your AE repo on GitHub. You’re done! 🎉

Acknowledgement

This AE is adapted from Model selection (by example) by Dr. Alex Fisher.