HW 01: Simple linear regression

Lemurs

Due date

This assignment is due on Thursday, January 29 at 11:59pm. To be considered on time, the following must be done by the due date:

Final .qmd and .pdf files pushed to your GitHub repo
Final .pdf file submitted on Gradescope

Introduction

In this assignment, you will use simple linear regression to explore the growth rate of young lemurs in the Duke Lemur Center. You will also explore the mathematical properties of linear regression models.

Learning goals

In this assignment, you will…

use matrix operations to show results about simple linear regression.
conduct exploratory data analysis.
fit and interpret simple linear regression models.
evaluate model fit.
continue developing a workflow for reproducible data analysis.

Getting started

Go to the sta221-sp26 organization on GitHub. Click on the repo with the prefix hw-01. It contains the starter documents you need to complete the assignment.
Clone the repo and start a new project in RStudio. See the Lab 00 instructions for details on cloning a repo and starting a new project in R.

Packages

The following packages are used in this assignment:

library(tidyverse)
library(tidymodels)
library(knitr)

# load other packages as needed

Conceptual exercises

Instructions

The conceptual exercises are focused on explaining concepts and showing results mathematically. Show your work for each question.

You may write the answers and associated work for conceptual exercises by hand or type them in your Quarto document.

Exercise 1

In lecture, we introduced \(\mathbf{H}\), the hat matrix, as a projection matrix that projects \(\mathbf{y}\) onto \(\text{Col}(\mathbf{X})\) . Here we will show some properties of \(\mathbf{H}\).

Show that \(\mathbf{H}\) is symmetric \((\mathbf{H}^\mathsf{T} = \mathbf{H})\).
Show that \(\mathbf{H}\) is idempotent \((\mathbf{H}^2 = \mathbf{H})\).
Show that all eigenvalues of \(\mathbf{H}\) are 0 or 1.

Exercise 2

This exercise is adapted from Casella and Berger (2024).

Suppose there are \(n\) observations \((x_1, y_1), \ldots, (x_n, y_n)\), such that the relationship between \(X\) and \(Y\) can be summarized as \[ y_i = \beta x_i^2 + \epsilon_i \hspace{8mm} \epsilon_i \sim N(0,\sigma^2_{\epsilon}) \]

Find, \(\hat{\beta}\), the least-squares estimator of \(\beta\).

Exercise 3

In class we used the sum of squared errors, \(\boldsymbol{\epsilon}^\mathsf{T}\boldsymbol{\epsilon}\) , to find the least-squares estimator, \(\hat{\boldsymbol{\beta}} = (\mathbf{X}^\mathsf{T}\mathbf{X})^{-1}\mathbf{X}^\mathsf{T}\mathbf{Y}\) . To show this is the least-squares estimator, we now need to show that we have found the estimator for \(\boldsymbol{\beta}\) that minimizes the sum of squared errors.

If the Hessian matrix \(\frac{\partial^2}{\partial \boldsymbol{\beta}^2} \boldsymbol{\epsilon}^\mathsf{T}\boldsymbol{\epsilon}\) is positive definite, then we know we have found a minimum.

Show that \(\frac{\partial^2}{\partial \boldsymbol{\beta}^2}\boldsymbol{\epsilon}^\mathsf{T}\boldsymbol{\epsilon} = 2\mathbf{X}^\mathsf{T}\mathbf{X}\) is positive definite.

Exercise 4

This exercise is adapted from Montgomery, Peck, and Vining (2021) .

Prove that the maximum value of \(R^2\) must be less than 1 if the data set contains observations such that there are different observed values of the response for the same value of the predictor (e.g., the data set contains observations \((x_i, y_i)\) and \((x_j, y_j)\) such that \(x_i = x_j\) and \(y_i \neq y_j\) ).

Exercise 5

Show that the sum of squared residuals (SSR) can be written as the following:

\[ \mathbf{y}^\mathsf{T}\mathbf{y} - \hat{\boldsymbol{\beta}}^\mathsf{T}\mathbf{X}^\mathsf{T}\mathbf{y} \]

Applied exercises

Instructions

The applied exercises are focused on use the methods from the course to analyze data and answer questions about real-world phenomena.

All work for the applied exercises must be typed in your Quarto document. Remember to render, commit, and push your work to GitHub regularly.

Write all narrative using complete sentences and include informative axis labels / titles on visualizations.

Data: Lemurs

The data used in this analysis includes measurements and other characteristics for lemurs from the Eulemur rufus (ERUF), Propithecus coquereli (PCOQ), and Varecia rubra (VRUB) taxa who have lived in the Duke Lemur Center. Though the lemurs are measured regularly while at the center, this data set includes one randomly selected measurement for each lemur taken when they were 24 months old or younger. These data were originally featured in as part of the TidyTuesday weekly data visualization challenge in August 2021.

lemurs <- read_csv("https://intro-regression.github.io/data/lemurs-sample-young.csv")

The analysis will focus on the following variables:

age: Age of the lemur in months
weight: Weight of the lemur in grams

Analysis goal

The goal of this analysis is to use linear regression to explore the growth rate of young lemurs (age 24 months or younger). More specifically, we want to use age to explain variability in weight.

Exercise 6

Let’s start by exploring the response and predictor variables.

Visualize distribution of weight. Describe the distribution.
Visualize the distribution of age. Describe the distribution.

Tip

The description should include shape, center, spread, and presence of outliers. Use specific values in your description. See Section 3.4.2 of Introduction to Regression Analysis for more information about describing univariate distributions. See the ggplot2 reference for example code and plots.

Exercise 7

Visualize the relationship between weight and age. Do you think a linear model is a reasonable choice to summarize the relationship between the two variables? Briefly explain.

Exercise 8

We will fit a model using age to explain variability in the weight. The model takes the form

\[ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} \]

State the dimensions of \(\mathbf{y}\), \(\mathbf{X}\), \(\boldsymbol{\beta}\), \(\boldsymbol{\epsilon}\) for this analysis. Your answer should have exact values given this data set.
Estimate the regression coefficients \(\hat{\boldsymbol{\beta}}\) in R using the matrix representation. Show the code used to get the answer.
Check your results from part (b) by using the lm function to fit the model. Neatly display your results using 3 digits.

Exercise 9

Compute \(R^2\) for the model in the previous exercise and interpret it in the context of the data.
Compute \(RMSE\) for the model from the previous exercise and interpret it in the context of the data.
Comment on the model fit based on \(R^2\) and \(RMSE\).

Exercise 10

a. Interpret the slope for the model fit in Exercise 8 in the context of the data.

b. A lemur named Strawberry (dlc_id = 6582) weighed 475.0 grams when she was 3.19 months old. What was predicted weight for this lemur based on the model? What is the residual?

AI Disclosure

Did you use an LLM / Generative AI tool to complete this assignment? If not, copy and paste the first option in your Quarto document. Otherwise, copy and paste all statements that describe how you used it. The purpose of the disclosure is for you to reflect on how you’re using AI in this course. It also helps me learn how students are most effectively using AI.

I didn’t use an LLM / Generative AI tool.
I asked it to clarify one or more exercises.
I asked it clarifying questions to better understand a concept.
I asked it to help write code to complete an exercise.
I gave it my code and asked it to help me fix it.
I asked it about an error or why code would do something I didn’t want.
I pasted the exercise prompt in AI and asked for help, but I wrote my answer myself.
I pasted the exercise prompt in AI and copied and pasted at least some of the answer into my Quarto document.
Other:______

Submission

Important

Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.

Remember – you must turn in a PDF file to the Gradescope page before the submission deadline for full credit.

If you write your responses to conceptual exercises by hand, you will need to combine your written work to the completed PDF for the applied exercises before submitting on Gradescope.

Instructions to combine PDFs:

Preview (Mac): support.apple.com/guide/preview/combine-pdfs-prvw43696/mac
Adobe (Mac or PC): helpx.adobe.com/acrobat/using/merging-files-single-pdf.html
- Get free access to Adobe Acrobat as a Duke student: oit.duke.edu/help/articles/kb0030141/

To submit your assignment:

Access Gradescope through the menu on the STA 221 Canvas site.
Click on the assignment, and you’ll be prompted to submit it.
Mark the pages associated with each exercise. All of the pages of your assignment should be associated with at least one question (i.e., should be “checked”).
Select the first page of your .PDF submission to be associated with the “Workflow & formatting” section.

Grading (50 points)

Component	Points
Ex 1	6
Ex 2	4
Ex 3	4
Ex 4	4
Ex 5	4
Ex 6	5
Ex 7	5
Ex 8	6
Ex 9	4
Ex 10	4
Completing AI Disclosure	1
Workflow & formatting	3

The “Workflow & formatting” grade is to assess the reproducible workflow and document format for the applied exercises. This includes having at least 3 informative commit messages, a neatly organized document with readable code, along with your name and the date updated in the YAML.

References

Casella, George, and Roger Berger. 2024. Statistical Inference. Chapman; Hall/CRC.

Montgomery, Douglas C, Elizabeth A Peck, and G Geoffrey Vining. 2021. Introduction to Linear Regression Analysis. John Wiley & Sons.