HW 04: Logistic regression

Due date

This assignment is due on Thursday, April 9 at 11:59pm. To be considered on time, the following must be done by the due date:

Final .qmd and .pdf files pushed to your GitHub repo
Final .pdf file submitted on Gradescope

Introduction

In this assignment you will use logistic regression to describe the relationship between a binary response variable and multiple predictor variables, along with classifying observations. You will also derive results for logistic regression and some properties of least-squares estimators for linear regression.

Learning goals

In this assignment, you will…

Derive properties of the least-squares estimator for linear regression
Derive results for logistic regression
Use logistic regression to explore the relationship between a binary response variable and multiple predictor variables
Conduct exploratory data analysis for logistic regression
Interpret coefficients of logistic regression model
Use statistics to compare and evaluate logistic regression models

Getting started

Go to the sta221-sp26 organization on GitHub. Click on the repo with the prefix hw-04. It contains the starter documents you need to complete the assignment.
Clone the repo and start a new project in RStudio. See the Lab 00 instructions for details on cloning a repo and starting a new project in R.Packages

The following packages are used in this assignment:

library(tidyverse)
library(tidymodels)
library(knitr)
library(pROC)

# load other packages as needed

Conceptual exercises

Instructions

The conceptual exercises focus on explaining concepts and deriving results mathematically. Show your work for each question.

You may write the answers and associated work for conceptual exercises by hand or type them in your Quarto document.

Exercise 1

Suppose we have a linear regression model of the form

\[ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} \hspace{8mm} \boldsymbol{\epsilon} \sim N(\mathbf{0}, \sigma^2_{\epsilon}\mathbf{I}) \]

In class, we showed that the least-squares estimator \(\hat{\boldsymbol{\beta}}\) is a consistent estimator by showing \(\lim_{n \rightarrow \infty} Var(\hat{\boldsymbol{\beta}}) = 0\) and \(\lim_{n \rightarrow \infty}Bias(\hat{\boldsymbol{\beta}}) = 0\). Here, we will show that \(\hat{\boldsymbol{\beta}}\) is a consistent estimator using the formal definition of consistency.

To do so, show \(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta} \rightarrow \mathbf{0}\) as \(n \rightarrow \infty\).

Tip

Hint: Recall from HW 02 that \(\hat{\boldsymbol{\beta}}\) can be expressed in terms of \(\boldsymbol{\beta}\).

Exercise 2

Show that the log-likelihood for the logistic regression model can be written as

\[\log L (\boldsymbol{\beta}|\mathbf{X}, \mathbf{y}) = \sum_{i=1}^n \mathbf{y}_i \mathbf{x}_i^\mathsf{T} \boldsymbol\beta - \sum_{i=1}^n \log(1+ \exp\{\mathbf{x}_i^\mathsf{T} \boldsymbol{\beta}\})\]

Provide a brief explanation for each step of your derivation.

Exercise 3

Consider \(\log L\) the log-likelihood for logistic regression.

Show that \(\frac{\partial}{\partial \boldsymbol{\beta}} \log L\) can be written in the following way:

\[ \mathbf{X}^\mathsf{T}\mathbf{y} - \mathbf{X}^\mathsf{T}\boldsymbol{\pi} \]

where \(\boldsymbol{\pi} = [\pi_1, \ldots, \pi_n]^\mathsf{T}\).

Tip

Note: You can start with the derivative of the log-likelihood function in the notes and show / explain why the above representation makes sense.

Show that the Hessian for the log-likelihood of logistic regression, \(\frac{\partial^2}{\partial \boldsymbol{\beta}\partial\boldsymbol{\beta}^\mathsf{T}}\log L\) , can be written as

\[ -\mathbf{X}^\mathsf{T}\mathbf{V}\mathbf{X} \]where \(\mathbf{V}\) is a diagonal matrix, such that \(V_{ii} = \pi_i(1 - \pi_i)\), the estimated variance for the \(i^{th}\) observation.

Tip

Recall the Hessian matrix is the square matrix of second partial derivatives.

The Hessian in part (b) describes the curvature of the log-likelihood function (i.e., larger magnitude of the Hessian corresponds to a steeper peak). Describe how the curvature of log-likelihood is related to \(Var(\hat{\boldsymbol{\beta}})\), the variance of the estimated coefficients.

Exercise 4

Adapted from an exercise in Agresti (2013).

Berry (2001) examined the effect of a player’s draft position among the pool of potential players in a given year to the probability on eventually being named an all star.

Let \(d\) be the draft position \((d = 1, 2, 3, \ldots)\) and \(\pi\) be the probability of eventually being named an all star. The researcher modeled the relationship between \(d\) and \(\pi\) using the following model:

\[ \log\Big(\frac{\pi_i}{1-\pi_i}\Big) = \beta_0 + \beta_1 \log d_i \]

Using this model, show that the odds of being named an all star are \(e^{\beta_0}d^{\beta_1}\) . Then, show how to calculate \(\pi_i\) based on this model.
Show that the odds of being named an all star for a first draft pick are \(e^{\beta_0}\).
In the study, Berry reported that for professional basketball \(\hat{\beta}_0 = 2.3\) and \(\hat{\beta}_1 = -1.1\), and for professional baseball \(\hat{\beta}_0 = 0.7\) and \(\hat{\beta}_1 = -0.6\) . Explain why this suggests that (1) being a first draft pick is more crucial for being an all star in basketball than in baseball and (2) players picked in high draft positions are relatively less likely to be all stars.

Applied exercises

Instructions

The applied exercises focus on using the methods from the course to analyze data and answer questions about real-world phenomena.

All work for the applied exercises must be typed in your Quarto document. Remember to render, commit, and push your work to GitHub regularly.

Write all narrative responses using complete sentences. Include informative axis labels and titles on all visualizations.

Data: Understanding pro-environmental behavior

Ibanez and Roussel (2022) conducted an experiment to understand the impact of watching a nature documentary on pro-environmental behavior. The researchers randomly assigned the 113 participants to watch a video about architecture in NYC (control) or a video about Yellowstone National Park (treatment). As part of the experiment, participants played a game in which they had an opportunity to donate to an environmental organization.

The data set is available in nature-experiment.csv in the data folder. We will use the following variables:

donation_binary:
- 1 - participant donated to environmental organization
- 0 - participant did not donate
age: Age in years
gender: Participant’s reported gender
- 1 - male
- 0 - non-male
treatment:
- “URBAN (T1)” - the control group
- “NATURE (T2)” - the treatment group
nep_high:
- 1 - score of 4 or higher on the New Ecological Paradigm (NEP)
- 0 - score less than 4

Tip

See the Introduction and Methods sections of Ibanez and Roussel (2022) for more detail about the variables.

Click here to access the paper online.

Exercise 5

Visualize the relationship between donating and treatment. Use the visualization to describe the relationship between the two variables.
Visualize the relationship between donating and age. Use the visualization to describe the relationship between the two variables.
We would like to use the mean-centered value of age in the model. Create a new variable age_cent that contains the mean-centered ages.

Exercise 6

Fit a logistic regression model using age_cent, gender, treatment, and nep_high to predict the odds of donating. Neatly display the model using 3 decimal places.
The researchers are most interested in the effect of watching the nature documentary. Describe the effect of treatment in terms of the odds of donating.
What group of participants is described by the intercept? What is the predicted probability a randomly selected individual in this group donates?

Exercise 7

The authors include an interaction effect between nep_high and treatment in one of their models.

Explain what an interaction between nep_high and treatment means in the context of the data.
Create a visualization to explore the potential interaction between these two variables. Based on the visualization, does there appear to be an interaction effect? Briefly explain.

Exercise 8

Conduct a drop-in-deviance test to determine if the interaction between nep_high and treatment should be added to the model fit in Exercise 6. Include the hypotheses in mathematical notation, the output from the test, and the conclusion in the context of the data.

AI disclosure

Did you use an LLM / Generative AI tool to complete this assignment? If not, copy and paste the first option in your Quarto document. Otherwise, copy and paste all statements that describe how you used it. The purpose of the disclosure is for you to reflect on how you’re using AI in this course. It also helps me learn how students are most effectively using AI.

I didn’t use an LLM / Generative AI tool.
I asked it to clarify one or more exercises.
I asked it clarifying questions to better understand a concept.
I asked it to help write code to complete an exercise.
I gave it my code and asked it to help me fix it.
I asked it about an error or why code would do something I didn’t want.
I pasted the exercise prompt in AI and asked for help, but I wrote my answer myself.
I pasted the exercise prompt in AI and copied and pasted at least some of the answer into my Quarto document.
Other:______

Submission

Warning

Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.

Remember: you must turn in a PDF file to the Gradescope page before the submission deadline for full credit.

If you write your responses to conceptual exercises by hand, you will need to combine your written work to the completed PDF for the applied exercises before submitting on Gradescope.

Instructions to combine PDFs:

Preview (Mac): support.apple.com/guide/preview/combine-pdfs-prvw43696/mac
Adobe (Mac or PC): helpx.adobe.com/acrobat/using/merging-files-single-pdf.html
- Get free access to Adobe Acrobat as a Duke student: oit.duke.edu/help/articles/kb0030141/

To submit your assignment:

Access Gradescope through the menu on the STA 221 Canvas site.
Click on the assignment, and you’ll be prompted to submit it.
Mark the pages associated with each exercise. All of the pages of your assignment should be associated with at least one question (i.e., should be “checked”).
Select the first page of your .PDF submission to be associated with the “Workflow & formatting” section.

Grading

Component	Points
Ex 1	5
Ex 2	5
Ex 3	8
Ex 4	7
Ex 5	7
Ex 6	5
Ex 7	5
Ex 8	4
AI Disclosure	1
Workflow & formatting	3

The “Workflow & formatting” grade is to assess the reproducible workflow and document format for the applied exercises. This includes having at least 3 informative commit messages, a neatly organized document with readable code and your name and the date updated in the YAML.

References

Agresti, Alan. 2013. Categorical Data Analysis. John Wiley & Sons.

Berry, Scott M. 2001. “A Statistician Reads the Sports Pages: Luck in Sports.” Chance 14 (1): 52–57.

Ibanez, Lisette, and Sébastien Roussel. 2022. “The Impact of Nature Video Exposure on Pro-Environmental Behavior: An Experimental Investigation.” Plos One 17 (11): e0275806.