In this chapter, we’ll be working with some real data from a study looking at the effects of sleep deprivation on psychomotor performance (Belenky et al., 2003)[^1]. Data from this study is included as the built-in dataset sleepstudy in the lme4 package for R.
Understanding your data
Question 1
First we need to load in the data.
```{r}set.seed(72024)library(tidyverse)```
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
```{r}library(lme4)```
Loading required package: Matrix
Attaching package: 'Matrix'
The following objects are masked from 'package:tidyr':
expand, pack, unpack
```{r}# Use data() to load a data set from a package into your environmentdata("sleepstudy")```
Examine your data. Use ?sleepstudy to look at the documentation for this dataset.
Question 2
Notice the documentation says “days 0-1 were adaptation and training (T1/T2), day 2 was baseline (B); sleep deprivation started after day 2”. For our purposes we don’t care about the days they were training participants, we want to look at the days they were sleep deprived.
Run the code below to make a new variable days_since_baseline and remove observations from days 0 and 1. Notice we are making a new data frame instead of overwriting the one from the package.
What general patter(s) do you see in the data? Would it be reasonable to assume independence between data points?
Question 2 Solution
There appears to be a positive linear association between reaction time and days since baseline, for each person. No, it would not be reasonable to assume independence because there are repeat measurements on each person, and in the plots we can see correlation within individuals.
Modeling
For our data exploration purposes let’s consider three different models:
linear model (we know this is incorrect because it breaks the independence assumption, but let’s see what happens…)
linear mixed effects models with subject-specific intercepts
linear mixed effects models with subject-specific intercepts and slopes
Question 3
Write out the theoretical regression equation for each proposed model.
Question 3 Solution
\[Reaction_i | Days_i = \beta_0 + \beta_1 Days_i + \epsilon_i,\] for day \(i\).
\[Reaction_{ij} | Days_{ij} = \beta_0 + b_j +\beta_1 Days_i + \epsilon_{ij},\] for day \(i\) subject \(j\).
\[Reaction_{ij} | Days_{ij} = \beta_0 + b_{0j} + \beta_1 Days_i + b_{1j} Days_i + \epsilon_{ij},\] for day \(i\) subject \(j\).
Question 4
Fit model 1 and save it into an object called model1.
After the model is fit use the following code to examine how well our linear regression equation represents the data. Does it do a good job?
Linear mixed model fit by REML ['lmerMod']
Formula: Reaction ~ days_since_baseline + (1 | Subject)
Data: sleep_study_data
REML criterion at convergence: 1430
Scaled residuals:
Min 1Q Median 3Q Max
-3.6261 -0.4450 0.0474 0.5199 4.1378
Random effects:
Groups Name Variance Std.Dev.
Subject (Intercept) 1746.9 41.80
Residual 913.1 30.22
Number of obs: 144, groups: Subject, 18
Fixed effects:
Estimate Std. Error t value
(Intercept) 267.967 10.871 24.65
days_since_baseline 11.435 1.099 10.40
Correlation of Fixed Effects:
(Intr)
dys_snc_bsl -0.354
```{r}# Make new data frame with predictorsmodel2_prediction_data <- crossing( Subject = sleep_study_data %>% pull(Subject) %>% levels() %>% factor(), days_since_baseline = 0:7) # Add column of predicted valuesmodel2_prediction_data <- mutate( model2_prediction_data, Reaction = predict(model2, model2_prediction_data))# Plot our regression lines for all participantsggplot(sleep_study_data, aes(x = days_since_baseline, y = Reaction)) + geom_line( data = model2_prediction_data, color = 'blue' ) + geom_point() + scale_x_continuous(breaks = 0:7) + facet_wrap(~ Subject) + labs(y = "Reaction Time", x = "Days deprived of sleep (0 = baseline)")```
This model does fit better generally, but still doesn’t represent the some subjects data well.
Question 6
Fit proposed model 2 and save it into an object called model2.
After the model is fit use the following code to examine how well our linear regression equation represents the data. Does it do a good job?
```{r}#| eval: false# Make new data frame with predictorsmodel3_prediction_data <- crossing( Subject = sleep_study_data %>% pull(Subject) %>% levels() %>% factor(), days_since_baseline = 0:7) # Add column of predicted valuesmodel3_prediction_data <- mutate( model3_prediction_data, Reaction = predict(model3, model3_prediction_data))# Plot our regression lines for all participantsggplot(sleep_study_data, aes(x = days_since_baseline, y = Reaction)) + geom_line( data = model3_prediction_data, color = 'blue' ) + geom_point() + scale_x_continuous(breaks = 0:7) + facet_wrap(~ Subject) + labs(y = "Reaction Time", x = "Days deprived of sleep (0 = baseline)")```
Linear mixed model fit by REML ['lmerMod']
Formula: Reaction ~ days_since_baseline + (1 + days_since_baseline | Subject)
Data: sleep_study_data
REML criterion at convergence: 1404.1
Scaled residuals:
Min 1Q Median 3Q Max
-4.0157 -0.3541 0.0069 0.4681 5.0732
Random effects:
Groups Name Variance Std.Dev. Corr
Subject (Intercept) 958.35 30.957
days_since_baseline 45.78 6.766 0.18
Residual 651.60 25.526
Number of obs: 144, groups: Subject, 18
Fixed effects:
Estimate Std. Error t value
(Intercept) 267.967 8.266 32.418
days_since_baseline 11.435 1.845 6.197
Correlation of Fixed Effects:
(Intr)
dys_snc_bsl -0.062
```{r}# Make new data frame with predictorsmodel3_prediction_data <- crossing( Subject = sleep_study_data %>% pull(Subject) %>% levels() %>% factor(), days_since_baseline = 0:7) # Add column of predicted valuesmodel3_prediction_data <- mutate( model3_prediction_data, Reaction = predict(model3, model3_prediction_data))# Plot our regression lines for all participantsggplot(sleep_study_data, aes(x = days_since_baseline, y = Reaction)) + geom_line( data = model3_prediction_data, color = 'blue' ) + geom_point() + scale_x_continuous(breaks = 0:7) + facet_wrap(~ Subject) + labs(y = "Reaction Time", x = "Days deprived of sleep (0 = baseline)")```
This model best represents each individual’s data.
Interpretation
Ultimately, we want to investigate the linear relationship between reaction time and sleep deprivation. Hopefully at this point you have found the model with random intercepts and slopes to best represent the data.
Let’s look at the fixed effect for sleep deprivation days (days_since_baseline).
For all the three models, what is the estimate of the association between sleep deprivation and reaction time?
For the average individual, we estimate an increase in average reaction time of 11.4354287ms to be associated with a 1 day increase in sleep deprivation.
Working with the lme4 objects… to get random effects and fixed effects use: