Day 9 Lab Data Wrangling Solutions

Today we are going to work with COVID hospital data available at https://data.ca.gov/dataset/covid-19-hospital-data1.

Make sure to practice using functions taught in lecture from packages such as tidyverse instead of base R functions.

Data wrangling

Begin by reading in the data

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
hosp_data <- read_csv("covid19hospitalbycounty.csv")
Rows: 67000 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): county
dbl  (7): hospitalized_covid_confirmed_patients, hospitalized_suspected_covi...
date (1): todays_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Use glimpse() to get a preview of the data and view the columns.

glimpse(hosp_data)
Rows: 67,000
Columns: 9
$ county                                <chr> "Kern", "Kern", "Shasta", "El Do…
$ todays_date                           <date> 2020-03-27, 2020-03-29, 2020-03…
$ hospitalized_covid_confirmed_patients <dbl> 0, 16, 0, 0, 74, 24, 1, 0, 1, 20…
$ hospitalized_suspected_covid_patients <dbl> 0, 57, 0, 23, 167, 85, 10, 9, 5,…
$ hospitalized_covid_patients           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ all_hospital_beds                     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ icu_covid_confirmed_patients          <dbl> 0, 8, 0, 0, 31, 5, 0, 0, 0, 9, 1…
$ icu_suspected_covid_patients          <dbl> 0, 8, 0, 12, 40, 19, 0, 0, 0, 8,…
$ icu_available_beds                    <dbl> NA, 39, NA, 11, 131, 14, 18, NA,…

How many observations and variables are in this data set? Which variables are categorical and which are numeric?

Let’s learn some information about this data.

What is the date range for this data?

range(hosp_data$todays_date)
[1] "2020-03-27" "2023-07-08"

This data ranges from March 27, 2020 to July 8, 2023.

How many counties are in this data and which counties are there? (It’s fine to just print output here)

unique(hosp_data$county)
 [1] "Kern"            "Shasta"          "El Dorado"       "Orange"         
 [5] "Ventura"         "Humboldt"        "Lassen"          "Tuolumne"       
 [9] "Stanislaus"      "San Bernardino"  "Tulare"          "Yuba"           
[13] "Contra Costa"    "Butte"           "Inyo"            "San Luis Obispo"
[17] "Los Angeles"     "Santa Barbara"   "Merced"          "Calaveras"      
[21] "Plumas"          "San Benito"      "Sonoma"          "Yolo"           
[25] "Glenn"           "Nevada"          "Siskiyou"        "Trinity"        
[29] "Riverside"       "Napa"            "Santa Clara"     "Placer"         
[33] "Imperial"        "Marin"           "Monterey"        "Mendocino"      
[37] "Alameda"         "San Francisco"   "Madera"          "Kings"          
[41] "San Diego"       "Del Norte"       "Santa Cruz"      "San Joaquin"    
[45] "Lake"            "San Mateo"       "Sacramento"      "Amador"         
[49] "Colusa"          "Modoc"           "Mono"            "Tehama"         
[53] "Sutter"          "Fresno"          "Solano"          "Mariposa"       

There are 56 counties in this data.

Say we want to compare average and median number of COVID confirmed hospitalizations for the counties of Los Angeles, Orange, San Francisco, Sonoma, and San Diego. (You may find it helpful to use the function %in%)

hosp_data %>% 
  filter(county %in% c("Los Angeles", "Orange", "San Francisco", "Sonoma", "San Diego")) %>% 
  group_by(county) %>% 
  summarize(
    ave_covid_hosp = mean(hospitalized_covid_confirmed_patients),
    median_covid_hosp = median(hospitalized_covid_confirmed_patients)
  )
# A tibble: 5 × 3
  county        ave_covid_hosp median_covid_hosp
  <chr>                  <dbl>             <dbl>
1 Los Angeles           1257.                763
2 Orange                 332.                197
3 San Diego              350.                258
4 San Francisco           75.0                63
5 Sonoma                  32.3                28
Trouble with filter

There are some other packages with functions named filter() or select() and if those packages are loaded most recently then sometimes you can get problems. In the future when you are doing data cleaning if filter() or select() are not working but you are confident you have called them corretly, check to make sure tidyverse is your most recently loaded package.

It does not necessarily make sense to compare raw counts because these counties do not have similar populations. Load in the county population data and join it with the hospital data. (I recommend you do not save over you data, instead make a new data frame for the combined data)

county_pop <- read_csv("county-pop.csv") %>% 
  rename(county = County, population = Population)
Rows: 58 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): County
dbl (1): Population

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
comb_hosp_pop <- full_join(x = hosp_data, y = county_pop)
Joining with `by = join_by(county)`

When you join data you always want to check that your joined data set has the expected number of rows and columns, if not, you may have used the wrong join function or your data may be missing values or have extra values.

dim(hosp_data)
[1] 67000     9
dim(county_pop)
[1] 58  2
glimpse(comb_hosp_pop)
Rows: 67,002
Columns: 10
$ county                                <chr> "Kern", "Kern", "Shasta", "El Do…
$ todays_date                           <date> 2020-03-27, 2020-03-29, 2020-03…
$ hospitalized_covid_confirmed_patients <dbl> 0, 16, 0, 0, 74, 24, 1, 0, 1, 20…
$ hospitalized_suspected_covid_patients <dbl> 0, 57, 0, 23, 167, 85, 10, 9, 5,…
$ hospitalized_covid_patients           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ all_hospital_beds                     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ icu_covid_confirmed_patients          <dbl> 0, 8, 0, 0, 31, 5, 0, 0, 0, 9, 1…
$ icu_suspected_covid_patients          <dbl> 0, 8, 0, 12, 40, 19, 0, 0, 0, 8,…
$ icu_available_beds                    <dbl> NA, 39, NA, 11, 131, 14, 18, NA,…
$ population                            <dbl> 900202, 900202, 180080, 192843, …

Here we care about the COVID hospital data, and are using the county population data to add info. There are more counties present in the county population data than the COVID hospital data. We only want to keep the info for the counties present in the COVID hosptial data. Consider which join is most appropriate for this, and change your join function accordingly.

comb_hosp_pop <- left_join(x = hosp_data, y = county_pop)
Joining with `by = join_by(county)`

Now that our data is joined, make a new variable which records daily percent of the population that is covid confirmed in the hospital for each county.

comb_hosp_pop <- comb_hosp_pop %>% 
  mutate(per_hosp_covid_cofirmed = 100 * hospitalized_covid_confirmed_patients / population)

Compute average daily percent of the county with COVID confirmed hospitalizations for the counties of Los Angeles, Orange, San Francisco, Sonoma, and San Diego.

comb_hosp_pop %>% 
  filter(county %in% c("Los Angeles", "Orange", "San Francisco", "Sonoma", "San Diego")) %>% 
  group_by(county) %>% 
  summarize(
    ave_per_covid_hosp = mean(per_hosp_covid_cofirmed)
  )
# A tibble: 5 × 2
  county        ave_per_covid_hosp
  <chr>                      <dbl>
1 Los Angeles              0.0125 
2 Orange                   0.0105 
3 San Diego                0.0105 
4 San Francisco            0.00851
5 Sonoma                   0.00654

This data has a wide date range, let’s narrow it down to look at the previously computed averages specifically for records between December 2020 and February 2021.

comb_hosp_pop %>% 
  filter(county %in% c("Los Angeles", "Orange", "San Francisco", "Sonoma", "San Diego")) %>% 
  filter(
    todays_date >= as.Date("2020-12-01") & todays_date <= as.Date("2021-02-01")
  ) %>% 
  group_by(county) %>% 
  summarize(
    ave_per_covid_hosp = mean(per_hosp_covid_cofirmed)
  )
# A tibble: 5 × 2
  county        ave_per_covid_hosp
  <chr>                      <dbl>
1 Los Angeles               0.0607
2 Orange                    0.0536
3 San Diego                 0.0401
4 San Francisco             0.0207
5 Sonoma                    0.0163

What do you notice when you compare the average daily percent hospitalized with confirmed covid for the entire time range with that for the selected few months?

Visualizing our data

ca_covid_hosp_data <- read_csv("covid19hospitalbycounty.csv")
Rows: 67000 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): county
dbl  (7): hospitalized_covid_confirmed_patients, hospitalized_suspected_covi...
date (1): todays_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ca_covid_hosp_data %>% 
  ggplot(aes(fill = county, x = todays_date, y = hospitalized_covid_confirmed_patients)) +
  geom_bar(position="stack", stat="identity")
Warning: Removed 8 rows containing missing values (`position_stack()`).

The above plot tells us something about CA trends, but prevents us from comparing trends among counties, in addition to being absolutely hideous. Let’s focus on just five counties.

ca_five_county_covid_hosp_data <- ca_covid_hosp_data %>%
  filter(county %in% c("Los Angeles", "Orange", "Sacramento", "Santa Clara", "San Francisco"))

ca_five_county_covid_hosp_data  %>% 
  ggplot(aes(fill = county, x = todays_date, y = hospitalized_covid_confirmed_patients)) +
  geom_bar(position = "stack", stat = "identity")

Your task

The 5 county graph is more readable than the first graph, but still has tons of problems if one really wants to compare COVID-19 hospitalization trends across CA counties. Create your own visualization of the 5 county data, remembering best practices that we talked about in the lecture. There are more than one way of doing this, so don’t be inhibited by trying to think of “the right” solution. Also, depending on what information you want to convey with your plot, you may consider making and plotting a new variable scaled by population of the county.