Day 10 Lab: Linear Regression Solutions

Simple linear regression

Statins are a class of drugs widely used to lower cholesterol. There are two main types of cholesterol: low density lipoprotein (LDL) and high density lipoprotein (HDL). Research suggests that adults with elevated LDL may be at risk for adverse cardiovascular events such as a heart attack or stroke. In 2013, a panel of experts commissioned by the American College of Cardiology and the American Heart Association recommended that statin therapy be considered in individuals who either have any form of atherosclerotic cardiovascular disease or have LDL cholesterol levels ≥ 190 mg/dL, individuals with Type II diabetes ages 40 to 75 with LDL between 70 to 189 mg/dL, and non-diabetic individuals ages of 40 to 75 with a predicted probability of future clogged arteries of at least 0.075.

Health policy analysts have estimated that if the new guidelines were to be followed, almost half of Americans ages 40 to 75 and nearly all men over 60 would be prescribed a statin. However, some physicians have raised the question of whether treatment with a statin might be associated with an increased risk of cognitive decline. Older adults are at increased risk for cardiovascular disease, but also for cognitive decline. A study by Joosten, et al. examined the association of statin use and other variables with cognitive ability in an observational cohort of 4,095 participants from the Netherlands who were part of the larger PREVEND study. We will focus on a random sample of 500 participants from the cohort.

Question 1

Read in an examine the data (e.g. use glimpse(), head(), summary(), etc.).

Cognitive function is measured via the Ruff Figural Fluency Test (RFFT). The RFFT is one measure of cognitive function that provides information about cognitive abilities such as planning and the ability to switch between different tasks. Scores range from 0 (worst) to 175 (best)

Question 1 Solution

```{r}
library(tidyverse)
```
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
```{r}
prevend_data <- read_csv("data/prevend.samp.csv")
```
Rows: 500 Columns: 31
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): Gender
dbl (30): Casenr, Age, Ethnicity, Education, RFFT, VAT, CVD, DM, Smoking, Hy...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
```{r}
glimpse(prevend_data)
```
Rows: 500
Columns: 31
$ Casenr        <dbl> 2266, 3235, 1068, 3422, 3570, 1932, 3134, 3573, 1103, 86…
$ Age           <dbl> 55, 65, 46, 68, 70, 53, 64, 70, 46, 44, 46, 75, 41, 41, …
$ Gender        <chr> "Female", "Female", "Male", "Female", "Male", "Male", "M…
$ Ethnicity     <dbl> 3, 0, 2, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Education     <dbl> 2, 1, 3, 2, 2, 0, 1, 3, 2, 3, 3, 1, 2, 1, 0, 3, 1, 3, 3,…
$ RFFT          <dbl> 62, 79, 89, 70, 35, 14, 31, 47, 88, 91, 77, 36, 73, 117,…
$ VAT           <dbl> -1, 11, 6, 5, 10, 7, 8, 5, 11, 11, 10, 7, 9, 10, 10, 12,…
$ CVD           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ DM            <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Smoking       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0,…
$ Hypertension  <dbl> 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0,…
$ BMI           <dbl> 39.66942, 28.98114, 23.23346, 22.34352, 32.41004, 29.158…
$ SBP           <dbl> 122.0, 107.5, 120.5, 114.5, 114.0, 126.5, 119.0, 141.0, …
$ DBP           <dbl> 63.5, 66.5, 75.0, 67.0, 72.5, 75.0, 77.0, 76.5, 99.0, 70…
$ MAP           <dbl> 86.0, 82.5, 92.5, 85.0, 89.5, 95.0, 92.5, 100.0, 119.5, …
$ eGFR          <dbl> 83.29573, 76.49061, 76.44909, 61.23697, 88.14615, 91.905…
$ Albuminuria.1 <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
$ Albuminuria.2 <dbl> 0, 0, 2, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 2, 0,…
$ Chol          <dbl> 3.86, 5.64, 6.83, 7.11, 5.04, 3.05, 4.90, 5.50, 3.92, 5.…
$ HDL           <dbl> 1.54, 1.53, 1.04, 1.85, 1.40, 0.79, 1.23, 1.57, 1.39, 1.…
$ Statin        <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,…
$ Solubility    <dbl> 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, -1, 2, 2, 1, 2…
$ Days          <dbl> -1, 1672, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1…
$ Years         <dbl> -1.0000000, 4.5808219, -1.0000000, -1.0000000, -1.000000…
$ DDD           <dbl> 0.000, 1373.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.00…
$ FRS           <dbl> 8, 11, -1, 9, 12, 8, 12, 15, 11, 8, 11, 21, 7, 4, 23, 1,…
$ PS            <dbl> 0.37427518, 0.25594606, 0.12850400, 0.09417574, 0.193432…
$ PSquint       <dbl> 5, 4, 3, 2, 4, 3, 3, 4, 3, 2, 2, 5, 2, 1, 5, 1, 3, 4, 1,…
$ GRS           <dbl> 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0,…
$ Match_1       <dbl> 816, 727, -1, 838, -1, 15, 200, -1, -1, -1, -1, 272, -1,…
$ Match_2       <dbl> 113, 242, -1, -1, 276, 121, -1, -1, -1, -1, -1, 814, -1,…

Question 2

Plot cognitive function by age. What relationship do you observe, if any?

Question 2 Solution

```{r}
ggplot(prevend_data, aes(x = Age, y = RFFT)) +
  geom_point() +
  theme_bw()
```

There appears to be a negative linear relationship between cognitive function and age.

Question 3

Let’s begin by fitting a simple linear model of RFFT regressed on age. Fit the model and write out the estimated linear model.

Question 3 Solution

```{r}
model1 <- lm(RFFT ~ Age, data = prevend_data)

model1_intercept <- summary(model1)$coefficients["(Intercept)", ]
model1_slope <- summary(model1)$coefficients["Age", ]
```

The estimated model is \[\hat{RFFT} = 137.55 - 1.26 Age\].

Question 4

Interpret \(\hat{\beta_0}\) and \(\hat{\beta_1}\) in context from this model. Do they make sense?

Question 4 Solution

We estimate an average RFFT score of 137.5497165 for someone of Age 0. The intercept does not make sense in this context.

(Note the inline R code used to answer this question.)

The average mean difference in RFFT for two populations that differ by one year is of NA. This seems reasonable judging from our plot.

Question 5

Reproduce the plot from question 2 with the regression line added to the plot.

Question 5 Solution

There are different ways to do this. I chose to manually specify the coefficients we got from question 3.

```{r}
ggplot(prevend_data, aes(x = Age, y = RFFT)) +
  geom_point() +
  theme_bw() +
  geom_abline(
    intercept = as.numeric(model1$coefficients["(Intercept)"]),
    slope  = as.numeric(model1$coefficients["Age"]), 
    color = "blue"
  )
```

Alternatively, using geom_smooth:

```{r}
ggplot(prevend_data, aes(x = Age, y = RFFT)) +
  geom_point() +
  theme_bw() +
  geom_smooth(
    method="lm",
    formula = "y~x",
    color = "blue"
  )
```

Linear regression

Question 6

In this investigation we want to understand the relationship between statin and cognitive function, specifically RFFT. Let’s now fit the model of RFFT regressed on statin.

Fit the model and write out the estimate regression equation.

Interpret each coefficient in context.

Let’s start by refactoring the variable Statin in our data frame

```{r}
prevend_data <- prevend_data |> mutate(Statin = as.factor(Statin))
```

Question 6 Solution

```{r}
model2 <- lm(RFFT ~ Statin, data = prevend_data)

summary(model2)
```

Call:
lm(formula = RFFT ~ Statin, data = prevend_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-56.714 -22.714   0.286  18.299  73.339 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   70.714      1.381  51.212  < 2e-16 ***
Statin1      -10.053      2.879  -3.492 0.000523 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 27.09 on 498 degrees of freedom
Multiple R-squared:  0.0239,    Adjusted R-squared:  0.02194 
F-statistic: 12.19 on 1 and 498 DF,  p-value: 0.0005226

Our estimated model is \[\hat{RFFT} = 70.71 - 10.05 \times Statin.\]

We estimate an RFFT score of 70.7142857 for someone not receiving statin therapy. We estimate statin users to have an RFFT score 10.0534161 lower than that of a non statin user, on average.

Question 7

The investigators behind the Joosten study anticipated an issue in the analysis—statins are used more often in older adults than younger adults, and older adults suffer a natural cognitive decline.

This means that Age is a potential confounder in this setting.

If age is not accounted for in the analysis, it may seem that cognitive decline is more common among individuals prescribed statins, simply because those prescribed statins are simply older and more likely to have reduced cognitive ability than those not prescribed statins.

We want to examine the relationship between statin use and cognitive function, controlling for age. Fit the appropriate regression and interpret the coefficient of statin in context.

Question 7 Solution

```{r}
model3 <- lm(RFFT ~ Statin + Age, data = prevend_data)

summary(model3)
```

Call:
lm(formula = RFFT ~ Statin + Age, data = prevend_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-63.855 -16.860  -1.178  15.730  58.751 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 137.8822     5.1221  26.919   <2e-16 ***
Statin1       0.8509     2.5957   0.328    0.743    
Age          -1.2710     0.0943 -13.478   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 23.21 on 497 degrees of freedom
Multiple R-squared:  0.2852,    Adjusted R-squared:  0.2823 
F-statistic: 99.13 on 2 and 497 DF,  p-value: < 2.2e-16

We estimate statin users to have an RFFT score NA higher than that of a non statin user, on average for people of the same age.

Question 8

We are learning to be responsible statisticians so we need to provide some measure of uncertainty for our estimates. In a real setting 95% confidence intervals would be reported, often in parentheses, with our reported estimate.

  • Compute 95% confidence intervals for the coefficient of statin for the models with and without age.
  • If they differ, is one of the confidence intervals wrong?

Question 8 Solution

```{r}
model_without_age <- summary(model2)$coefficients["Statin1", ]

(model2_95_ci <- model_without_age["Estimate"] + c(-1, 1) * qnorm(1 - 0.05 / 2) * model_without_age["Std. Error"])

model_with_age <- summary(model3)$coefficients["Statin1", ]

(model3_95_ci <- model_with_age["Estimate"] + c(-1, 1) * qnorm(1 - 0.05 / 2) * model_with_age["Std. Error"])
```
[1] -15.696528  -4.410304
[1] -4.236622  5.938361

The 95% confidence interval for difference in RFFT for statin users compared to non statin users is (-15.7, -4.41), and controlling for age it’s (-4.24, 5.94).

Alternatively, we can use the confint function in R.

```{r}
print("For the model without Age:")
confint(model2)["Statin1",]
print("For the model with Age:")
confint(model3)["Statin1",]
```
[1] "For the model without Age:"
     2.5 %     97.5 % 
-15.710276  -4.396556 
[1] "For the model with Age:"
    2.5 %    97.5 % 
-4.249041  5.950781 

Remember you can always use ?confint for details.

Neither interval is “wrong”, they just correspond to different values, so it just depends on what we care about. We wanted to investigate the true relationship between cognition and statin use, but age confounds this relationship so we do want to control for it.

Question 9

What if we were now interested in the relationship between cognitive function and age, and we suspect that this relationship would differ for those using statin and those not.

  • Fit the appropriate model for this (i.e. we want to allow for different slopes from age for non statin and statin users).
  • Report the coefficient of the association between Age and RFFT for the group that takes statin and the one that does not take statin.
  • Did you find evidence of a different slope for age based on statin usage?
  • Plot the resulting regression lines for non statin and statin users, with 95% credible intervals, over the plot from question 2.

Question 9 Solutions

```{r}
model4 <- lm(RFFT ~ Age + Statin + Age * Statin, data = prevend_data)
summary(model4)
int_sum <- summary(model4)$coefficient["Age:Statin1", ]
model_4_96_ci <- int_sum["Estimate"] + c(-1, 1) * qnorm(1 - 0.05 / 2) * int_sum["Std. Error"]
confint(model4)
```

Call:
lm(formula = RFFT ~ Age + Statin + Age * Statin, data = prevend_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-64.551 -16.963  -1.179  15.764  58.802 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 140.2031     5.6209  24.943   <2e-16 ***
Age          -1.3149     0.1040 -12.646   <2e-16 ***
Statin1     -13.9720    15.0113  -0.931    0.352    
Age:Statin1   0.2474     0.2468   1.003    0.317    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 23.21 on 496 degrees of freedom
Multiple R-squared:  0.2866,    Adjusted R-squared:  0.2823 
F-statistic: 66.42 on 3 and 496 DF,  p-value: < 2.2e-16

                  2.5 %     97.5 %
(Intercept) 129.1593754 151.246847
Age          -1.5192091  -1.110615
Statin1     -43.4656373  15.521594
Age:Statin1  -0.2374898   0.732383

We estimate the interaction coefficient to be between -0.24 and 0.73, with 95% confidence, therefore we did not find evidence of a different slope for age based on statin usage.

This is also visually evident in the following plot.

```{r}
ggplot(
  data = prevend_data, 
  aes(x = Age, y = RFFT, color = Statin)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE) + theme_bw()
```
`geom_smooth()` using formula = 'y ~ x'