Day 11 Lab: Poisson Regression Solutions

The Florida Museum of Natural History maintains the International Shark Attack File, a record of unprovoked shark attacks (and resultant fatalities) worldwide. Patterns in Florida are particularly interesting since the peninsular nature of the state means that almost all Florida residents and visitors are within 90 minutes of the Atlantic Ocean or Gulf of Mexico. Further, since population figures for Florida are available, it is possible to account for changes in coastal population by examining patterns in attack rates, rather than simply attack counts.

The shark data can be found in Simonoff’s book Analyzing Categorical Data. For our purposes we will assume that the observations are independent.

```{r}
library(tidyverse)
```
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Previously we’ve been working with data stored in csv format. There are different ways to store data, and there are corresponding functions to read in the data.

```{r}
shark_data <- readxl::read_xls("data/floridashark.xls")
```

Poisson regression for counts

Question 1

Examine the data and plot the number of attacks by year.

Question 1 Solution

```{r}
glimpse(shark_data)

ggplot(data = shark_data, aes(x = Year, y = Attacks)) +
  geom_point()
```
Rows: 54
Columns: 4
$ Year       <dbl> 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955,…
$ Population <dbl> 2473000, 2539000, 2578000, 2668000, 2771305, 2980000, 31570…
$ Attacks    <dbl> 0, 1, 0, 0, 1, 0, 3, 1, 0, 0, 1, 5, 2, 4, 2, 4, 2, 3, 7, 6,…
$ Fatalities <dbl> 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,…

Question 2

Shark attacks are a rare event and Poisson regression seems like it may be appropriate to model these counts. If the following model were fit \[\log(E[Attacks_i]) = \beta_0 + \beta_1 Year_i,\] the intercept would not have a meaningful interpretation. Instead let’s make it have a meaningful interpretation by adjusting for the baseline year \[\log(E[Attacks_i]) = \beta_0 + \beta_1 I(Year_i - 1946).\]

Fit this model and interpret \(\beta_0\) and \(\beta_1\) in context.

Question 2 Solution

```{r}
count_model <- glm(
  Attacks ~ I(Year - 1946), 
  data = shark_data, 
  family = poisson(link = "log")
) 

count_model_sum <- summary(count_model)

count_model_95cis_beta0 <- count_model_sum$coefficients["(Intercept)", "Estimate"] + c(-1, 1) * qnorm(1 - 0.05 / 2) * count_model_sum$coefficients["(Intercept)", "Std. Error"]

count_model_95cis_beta1 <- count_model_sum$coefficients["I(Year - 1946)", "Estimate"] + c(-1, 1) * qnorm(1 - 0.05 / 2) * count_model_sum$coefficients["I(Year - 1946)", "Std. Error"]
```

We estimate the log average number of attacks in year 1964 to be -0.1174036 (95% C.I. (-0.4566943, 0.2218871)).

We estimate an increase in log average number of attacks of 0.0614907 (95% C.I. (0.0532424, 0.069739)), to be associated with a one year increase in time.

Question 3

The direct interpretations from question 3 are less understandable since they are on the log scale. Luckily, for Poisson regression we can, and should in most cases, exponentiate our coefficients to talk about a change in factor of counts. Compute \(exp(\beta_0)\) and \(exp(\beta_1)\), their corresponding 95% confidence intervals, and their interpretations in context. If you get the difference from 1, i.e. \(exp(\beta_1) -1\), and multiply by 100% you can talk about a percent change, which is often more understandable than just \(exp(\beta_1)\).

Question 3 Solution

```{r}
model_count_ebeta0 <- exp(count_model_sum$coefficients["(Intercept)", "Estimate"])

count_model_95cis_ebeta0 <- exp(count_model_95cis_beta0)

model_count_ebeta1 <- exp(count_model_sum$coefficients["I(Year - 1946)", "Estimate"])

count_model_95cis_ebeta1 <- exp(count_model_95cis_beta1)
```

We estimate the average number of attacks in year 1964 to be 0.8892262 (95% C.I. (0.6333739, 1.2484304)).

We estimate an 6.3420581% (95% C.I. (5.4685249%, 7.2228262)) increase in average number of attacks to be associated with a one year increase in time.

Increase every 10 years

```{r}
## Take estimates for original model (in log scale!!)
## and multiply by 10

## 95% CI for count model for ratio of increase every 10 years.
exp(10*count_model_95cis_beta1)

## alternatevely...
count_model_10years <- glm(
 Attacks ~  I((1/10)*(I(Year - 1946))), 
 data = shark_data, 
 family = poisson(link = "log")
) 

count_model_10years_sum <- summary(count_model_10years)

count_model_10years_95cis_beta1 <- count_model_10years_sum$coefficients[2, "Estimate"] + c(-1, 1) * qnorm(1 - 0.05 / 2) * count_model_10years_sum$coefficients[2, "Std. Error"]
exp(count_model_10years_95cis_beta1)
```
[1] 1.703055 2.008503
[1] 1.703055 2.008503

Poisson Regression for rates (Exposure)

Notice the population is steadily increasing each year, so comparing counts across years does appear to be the best idea.

```{r}
shark_data$pop_per_100k <- shark_data$Population / 100000

ggplot(data = shark_data, aes(x = Year, y = pop_per_100k)) +
  geom_point() +
  labs(
    y = "Attack rate per 100,000 in population"
  )
```

Question 4

We want to instead compare shark attack rate, since that is comparable across different years with different populations.

\[\log(Attacks_i) = \log(Population per 100,000 people) + \beta_0 + \beta_1 * I(Year_i - 1964)\]

Fit the model, and provide an estimate and 95% confidence interval for the % change in attack rate associated with a one year difference. You can add the offset by adding offset(log(pop_per_100k)) to the regression equation on the right side of the ~.

Question 4 Solution

```{r}
rate_model <- glm(
  Attacks ~ offset(log(pop_per_100k)) + I(Year - 1946), 
  data = shark_data, 
  family = poisson(link = "log")
) 

rate_model_sum <- summary(rate_model)

# beta1 estimate and 95% ci
rate_model_beta1 <- rate_model_sum$coefficients["I(Year - 1946)", "Estimate"]

rate_model_beta1_95_ci <- rate_model_beta1 + c(-1, 1) * qnorm(1 - 0.05 / 2) * rate_model_sum$coefficients["I(Year - 1946)", "Std. Error"]

# (exp(beta1) - 1) * 100% estimate and 95% CI
rate_model_per_change_ebeta1 <- (exp(rate_model_beta1) - 1) * 100
rate_model_per_change_ebeta1_95_ci <- (exp(rate_model_beta1_95_ci) - 1) * 100
```

We estimate the percent change in attack rate associated with a one year difference to be 3.17% (95% CI (2.29%, 4.05%)).

Question 5

Provide an estimate and 95% confidence interval for the % change in attack rate associated with a ten year difference.

Question 5 Solution

```{r}
# (exp(beta1 * 10) - 1) * 100% estimate and 95% CI
rate_model_per_change_e10beta1 <- (exp(rate_model_beta1 * 10) - 1) * 100
rate_model_per_change_e10beta1_95_ci <- (exp(rate_model_beta1_95_ci * 10) - 1) * 100
```

We estimate the percent change in attack rate associated with a one year difference to be 36.58% (95% CI (25.39%, 48.77%)).

\(log(E[Y])\) versus \(E[log(Y)]\)

Question 6

Instead of using the Poisson regression model, what if we had just created a new column in our data set \(log(attack)\) and fit a linear regression?

```{r}
shark_data$log_attacks <- log(shark_data$Attacks + 0.0000001)
```

Why did I add a small number to Attacks inside of the log?

Write out the linear model for \(log(Attack_i)\) as the response and write out the Poisson model for \(Attack_i\) as the response, with the log link.

Question 6 Solution

Linear model \[E[log(Attack_i)] = \beta_0 + \beta_1 I(Year - 1964)\]

Poisson model \[log(E[Attacks_i) = \beta_0 + \beta_1 I(Year - 1964)\]

Question 7

Fit the linear model and compare to your Poisson model from question 2. Are you estimating the same thing?

Question 7 Solution

```{r}
linear_model <- lm(log_attacks ~ I(Year - 1946), data = shark_data)

linear_model_sum <- summary(linear_model)

lm_beta1 <- linear_model_sum$coefficients["I(Year - 1946)", "Estimate"]
glm_beta1 <- count_model_sum$coefficients["I(Year - 1946)", "Estimate"]

summary(linear_model)
summary(count_model)
```

Call:
lm(formula = log_attacks ~ I(Year - 1946), data = shark_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.5607  -0.9584   0.8195   3.7792   6.2136 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -6.41709    1.36661  -4.696 1.98e-05 ***
I(Year - 1946)  0.21703    0.04445   4.882 1.04e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.091 on 52 degrees of freedom
Multiple R-squared:  0.3143,    Adjusted R-squared:  0.3011 
F-statistic: 23.84 on 1 and 52 DF,  p-value: 1.038e-05


Call:
glm(formula = Attacks ~ I(Year - 1946), family = poisson(link = "log"), 
    data = shark_data)

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -0.117404   0.173111  -0.678    0.498    
I(Year - 1946)  0.061491   0.004208  14.611   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 387.38  on 53  degrees of freedom
Residual deviance: 117.17  on 52  degrees of freedom
AIC: 286.82

Number of Fisher Scoring iterations: 5

No we are not estimating the same thing.

  • By using the linear model we are treating log(Attacks) as a continuous variable and estimating change in expected log(Attacks) associated with a one year difference.
  • With the generalized linear model (specifically the Poisson model) are treating Attacks as discrete and we are estimating change in log of expected Attacks associated with a one year difference.