Lab 06: Maximum likelihood estimation

Due date

This lab is due on Tuesday, March 24 at 11:59pm. To be considered on time, the following must be done by the due date:

Final .qmd and .pdf files pushed to your team’s GitHub repo
Final .pdf file submitted on Gradescope

Introduction

In this lab you will compute maximum likelihood estimates for regression models looking at the relationship between features of penguins living in Palmer Archipelago in Antarctica. You will also explore properties of maximum likelihood estimators and how they are related to least-squares estimators.

Learning goals

By the end of the lab you will be able to…

compute estimates for \(\hat{\boldsymbol{\beta}}\) and \(\sigma^2_{\epsilon}\) using maximum likelihood estimation.
understand how assumptions of linear regression connect to maximum likelihood estimation.
describe the similarities and differences between maximum likelihood estimators and least-squares estimators.
evaluate which estimation procedure may be preferable in a given analysis scenario.

Getting started

Go to the sta221-sp26 organization on GitHub. Click on the repo with the prefix lab-06. It contains the starter documents you need to complete the lab.
Clone the repo and start a new project in RStudio. See the Lab 00 instructions for details on cloning a repo and starting a new project in R.
Each person on the team should clone the repository and open a new project in RStudio. Throughout the lab, each person should get a chance to make commits and push to the repo.

Packages

You will use the following packages in today’s lab. Add other packages as needed.

library(tidyverse)
library(tidymodels)
library(knitr)

Data

Today’s dataset include information about characteristics of three species of penguins living in Palmer Archipelago in Antarctica. The data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network (Gorman, Williams, and Fraser 2014).

The data are in the file palmer-penguins.csv in the data folder. This dataset is originally from the penguins data frame in the palmerpenguins R package with observations that have missing values removed. This analysis will focus on the following variables:

bill_depth_mm: bill depth in millimeters
bill_length_mm: bill length in millimeters
species: penguin species (Adélie, Chinstrap and Gentoo)

Click here to see the full data dictionary.

Exercises

Goal: The goal of this analysis is to use bill length to understand variability in bill depth, after accounting for species.

Exercise 1

We’ll start with exploratory data analysis focused on the relationship between the response and predictor variables.

Visualize the relationship between the response variable bill_depth_mm and predictor bill_length_mm .
Now, visualize the relationship between bill_depth_mm and bill_length_mm by species. Use geom_smooth(method = "lm", se = FALSE) to add lines and more clearly visualize the relationship for each species.
Based on these visualizations, why is it important to include species when in the model of the relationship between bill depth and length? Briefly explain.
Based on these visualizations, would you include an interaction term between the two predictors? Briefly explain?

Exercise 2

We will fit the main effects model using bill length and species to understand variability in the bill depth.

Write the form of the statistical (population-level) model in matrix form.
Write the dimensions for \(\mathbf{y}, \mathbf{X}, \boldsymbol{\beta}, \boldsymbol{\epsilon}\) specific for this problem.

Exercise 3

Consider the regression model described in Exercise 2.

Write the likelihood function \(L(\boldsymbol{\beta}, \sigma^2_{\epsilon} | \mathbf{y}, \mathbf{X})\) in matrix form.
Describe how each of the four model assumptions is necessary for the form of the likelihood function.

Exercise 4

Briefly explain how the process of finding the maximum likelihood estimators for the likelihood function in Exercise 3 is related to the process of finding the least-squares estimators for the model in Exercise 2.

Exercise 5

For the next few exercises, we will compare the results of the maximum likelihood and least-squares procedures.

Fit the least-squares regression model described in Exercise 2. Neatly display the results using three digits.
Describe the estimated effect of bill length on bill depth in the context of the data.
Describe the estimated effect of species on bill depth in the context of the data. Include discussion about whether there is statistical evidence of a difference between species.

Exercise 6

Use matrix/vector operations to compute the maximum likelihood estimators \(\tilde{\beta}\) for the model in Exercise 2.
How do these estimators compare to the least-squares estimators in the previous exercise?

Exercise 7

The maximum likelihood estimation procedure also produces an estimator for the variance about the regression line, \(\sigma^2_\epsilon\), which we can write as

\[ \tilde{\sigma}^2_{\epsilon} = \frac{1}{n} \mathbf{e}^\mathsf{T}\mathbf{e} \]

We know that the maximum likelihood estimator and least-squares estimator for \(\sigma^2_{\epsilon}\) are not equal. Additionally, the least-squares estimator \(\hat{\sigma}^2_{\epsilon}\) is unbiased. We want to find a scaling factor \(c\) such that the maximum likelihood estimator is unbiased.

Using the data and regression estimates for this analysis, compute both the maximum likelihood and least-squares estimators for \(\sigma_{\epsilon}^2\), and then find \(c\) by solving the equation \[ \hat\sigma^2_{\epsilon} = c \cdot \tilde\sigma^2_{\epsilon} \]You can do this last step either computationally or algebraically.

Exercise 8

Now we will look into the last property of the maximum likelihood estimator for \(\boldsymbol{\beta}\), and thus of least-squares estimator - asymptotic normality.

In words, this property says that, when the number of samples \(n\) is large compared to the number of predictors \(p\), the maximum likelihood estimator \(\tilde{\boldsymbol\beta}\) follows a (multivariate) normal distribution \(N\big(\boldsymbol\beta, \sigma_\epsilon^2 (\mathbf{X}^\mathsf{T}\mathbf{X})^{-1}\big)\). Let’s use this to construct an approximate confidence interval for \(\beta_j\), the coefficient for speciesChinstrap.

Use \(\tilde{\sigma}^2_{\epsilon}\), the maximum likelihood estimator, to compute the approximate confidence interval for \(\beta_j\) . The approximate 95% confidence interval may be computed as

\[ \tilde{\beta}_j\pm 2 \times SE(\tilde{\beta}_j) \]

Then interpret this interval in the context of the data.

Exercise 9

Compute the exact (based on the \(t\)-distribution) confidence interval for \(\beta_j\), the coefficient of speciesChinstrap.
Compare the center and width of the this exact interval with the one you computed in Exercise 8. Do they differ? By how much? Which one is wider, indicating more uncertainty?

Exercise 10

To wrap up, we have seen that both the OLS and the maximum likelihood procedures for linear regression produce the same coefficient estimates, but lead to different estimators for the variance \(\sigma_\epsilon^2\) and allow for different types of uncertainty quantification.

Based on the work in this lab, do you think performing inference based on either method would have changed your conclusion about the the relationship between bill depth and Chinstrap species?

Submission

You will submit the PDF documents for labs, homework, and exams in to Gradescope as part of your final submission.

Warning

Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.

Remember – you must turn in a PDF file to the Gradescope page before the submission deadline for full credit.

To submit your assignment:

Access Gradescope through the menu on the STA 221 Canvas site.
Click on the assignment, and you’ll be prompted to submit it.
Select the name of every team member.
Mark the pages for the “Exercises” question.
Select the first page of your .PDF submission to be associated with the “Team agreement” and “Workflow & formatting” questions.

Grading (10 pts)

This lab will be graded based on completion and workflow & formatting. The point breakdown is as follows:

**Completion:** Points will be awarded for completion based on the following:
Component	Points
Complete exercises	8
Workflow & formatting	2

8 pts: Complete all exercises
7 pts: Complete 8 - 9 exercises
6 pts: Complete 6 - 7 exercises
5 pts: Complete 4 - 5 exercises
4 pts: Complete 2 - 3 exercises
0 pts: Complete < 2 exercises

Workflow & formatting

The “Workflow & formatting” grade is to assess the reproducible workflow and collaboration. This includes having at least one meaningful commit from each team member and updating the team name and date in the YAML.

2 pts: Meet all criteria
1 pt: Meet some criteria
0 pt: Meet no criteria

References

Gorman, Kristen B., Tony D. Williams, and William R. Fraser. 2014. “Ecological Sexual Dimorphism and Environmental Variability Within a Community of Antarctic Penguins (Genus Pygoscelis).” Edited by André Chiaradia. PLoS ONE 9 (3): e90081. https://doi.org/10.1371/journal.pone.0090081.