library(tidyverse)
library(knitr)
library(tidymodels)
nba_salaries <- read_csv("data/nba-salaries-2022-23.csv") AE 05: Model selection
Go to the course GitHub organization and locate your ae-05 repo to get started.
Introduction
Today’s dataset contains player information and statistics for professional NBA players in the 2022 - 2023 season. It is a subset of the NBA Player Salaries (2022-23 Season) dataset on Kaggle. The data are originally from the websites Hoopshype and Basketball Reference.
The goal of the analysis is to use player features and statistics to understand variability in salary.
Response
Salary: salary in millions of US dollars
Predictors
Position: player positionAge: age of playerGP: total number of games played in the seasonMP: average minutes played per gameGS: number of games the player is put in at the start of the gameFG: average number of field goals (shots) attempted per gameFG%: percentage of shots made2PA: average number of 2-point shots attempted per game2P%: percentage of 2-point shots madeAST: average number of assists per gameAST%: percentage of teammate field goals a player assisted on while they were on the floor
Exercise 1
We are interested in comparing the salaries of players in the point guard position (PG) versus all other positions. Create a new indicator called PG that is TRUE if the player is a point guard and FALSE otherwise.
Use PG instead of Position for the remainder of the AE.
Exercise 2
We will consider an interaction effect between Age and PG. Make a visualization that can be used to explore this potential effect. Based on the visualization, does the effect of Age appear to differ by PG?
Exercise 3
Fit the main effects model using all predictors in the data to explain variability in Salary.
As a short cut, remove Player Name and Position from the data and use the syntax Salary ~ . in lm(). The . means to input all columns (except Salary) as predictors.
Exercise 4
Use the glance() function to compute AIC and BIC for the model in the previous exercise.
Exercise 5
Using the output from Exercise 3, fit a second model that includes only the predictors with p-values below 0.05, along with Age, PG, and the interaction between Age and PG.
Exercise 6
Compute AIC and BIC for the model from the previous exercise.
Which model do you choose based on AIC - model from Exercise 3 or Exercise 5?
Which model do you choose based on BIC?
Exercise 7
“Manually” compute AIC and BIC for the model selected in the previous exercise.
\(\log L\) can be obtained from glance(model)$loglik
Wrapping up
Once you’ve completed the AE:
- Render the document to produce the PDF with all of your work from today’s class.
- Push all your work to your AE repo on GitHub. You’re done! 🎉
Acknowledgement
This AE is adapted from Model selection (by example) by Dr. Alex Fisher.