Day 04: Lab

In this lab we will revisit the hospital COVID data again.

library(tidyverse)
library(janitor)
library(here)

hospital_data_og <- read_csv(
  here::here("week1-material/day04-wrangle2-dates/data/covid19hospitalbycounty.csv"),
  show_col_types = FALSE
)

population_data_og <- read_csv(
  here::here("week1-material/day04-wrangle2-dates/data/county-population-census2020.csv"),
  show_col_types = FALSE
)

# This code is from lab 3
population_data_1row <- population_data_og |> 
  clean_names() |> 
  filter(label_grouping == "Total:") |> 
  select(-label_grouping)

population_data_uncleaned <- population_data_1row |> 
  pivot_longer(cols = colnames(population_data_1row)) |> 
  rename(county = name, population = value)

Note the use of here() to read in data that is stored in a different folder than this code file. Check what happens if you try to just run hospital_data_og <- read_csv("data/covid19hospitalbycounty.csv").

Manipulating county variable

In lab 3 we provided the code to do string cleaning of the county column in the population data. Recall the population data from the US Census had the ugly naming convention, and we changed the hospital county names to match that convention.

Question 1

We are now going to take a step by step look at that cleaning. Make sure you understand what we did in the code chunk below. Refer to your lecture notes or look at help documentation for functions you are unsure about.

hospital_data <- hospital_data_og |> 
  mutate(county = str_to_lower(county)) |>
  mutate(county = str_replace_all(county, pattern = " ", replacement = "_")) |>
  mutate(county = str_c(county,"_county_california"))

Question 2

The naming convention used by the census made sense when it contained data for all counties in all states, but here we are only considering California counties so the naming is redundant. Let’s clean our county column in the population data to match the naming convention for the county column in the original hospital data.

Remove “_county_california” from each name in the county column of population_data_uncleaned and save the new data frame into an object named population_data1.

population_data1 <- population_data_uncleaned |>
  mutate(county = str_remove(county,"_county_california"))

Question 3

Replace all “_” with ” ” in each name in the county column of population_data1 and save the new data frame into an object named population_data2.

population_data2 <- population_data1 |>
  mutate(county = str_replace_all(county,"_"," "))

Question 4

Capitalize the first letter of each word in each name in the county column of population_data2 and save the new data frame into an object named population_data3.

population_data3 <- population_data2 |>
  mutate(county = str_to_title(county))

Question 5

Use anti_join() to check that you did the string manipulations correctly. If the county names match, this should return a data frame with two rows for those counties present in the population data and not present in the hospital data.

anti_join(population_data3,hospital_data_og)
Joining with `by = join_by(county)`
# A tibble: 2 × 2
  county population
  <chr>       <dbl>
1 Alpine       1204
2 Sierra       3236

Alpine and Sierra counties did not find a match.

Question 6

Notice in questions 5-7 I had you make a new data frame each time you made a change to the county data. I find this helpful when I am mutating existing columns in my data and want to check that I am doing it correctly, but this can cause confusion later on, especially for long code scripts/files. One could easily forget the difference between population_data_uncleaned, population_data1, population_data2, and population_data3, and accidentally use the wrong one.

Once you are confident in your code it is better practice to do data cleaning using pipes which save into a single data frame with clear naming, instead individual data frames for each mutate() call. Combine your code from questions 5-7 into a single pipe (like in question 4) that begins with population_data_uncleaned and saves into an object named population_data_cleaned.

population_data_cleaned <- population_data_uncleaned |>
    mutate(county = str_remove(county,"_county_california")) |>
    mutate(county = str_replace_all(county,"_"," ")) |>
    mutate(county = str_to_title(county))  

Question 7

Now join the hospital and population data into a single data frame (with an understandable name!).

hospital_pop_data <- left_join(x=hospital_data_og,population_data_cleaned)
Joining with `by = join_by(county)`

New numeric variable

Question 8

Like in lab 11, make a new variable which records daily percent of the population that is COVID confirmed in the hospital for each county.

hospital_pop_data <- hospital_pop_data |> mutate(pct_hospital = hospitalized_covid_confirmed_patients / population *100)

Saving your cleaned data

Question 9

Arrange your data by date and then county using arrange().

hospital_pop_data <- hospital_pop_data |> arrange(county)

Question 10

Save your cleaned combined data frame into the data folder using write_csv(). Make sure to follow the naming convention discussed in today’s lecture.

Extra challenge: New factor variable (optional)

I recommend tackling this section once you finish labs part 1 and part 2, if you have time.

Question 11

Make a new variable called covid_burden with three categories which correspond to:

  • “low” when the fraction of hospitalized COVID confirmed patients out of all hospital beds is less than 10%
  • “medium” when the fraction of hospitalized COVID confirmed patients out of all hospital beds is greater than or equal to 10 % and less than 25%
  • “high” when the fraction of hospitalized COVID confirmed patients out of all hospital beds is greater than or equal to 25%

Make all remaining rows have value NA for this variable (these would be any row with an NA for either hospitalized_covid_confirmed_patients or all_hospital_beds).

To check you did this right you should have 2,969 high, 9,986 medium, 66,392 low, and 9,242 NA. You can use table(combined_covid_hosp_pop_data$covid_burden, useNA = "ifany") to check.

hospital_pop_data <- hospital_pop_data |> 
  mutate(
    covid_burden = cut(
      (hospitalized_covid_confirmed_patients / all_hospital_beds )*100,
      breaks=c(0,10,25,100),right=FALSE)
  )
table(hospital_pop_data$covid_burden, useNA="ifany")

  [0,10)  [10,25) [25,100)     <NA> 
   66392     9986     2969     9242 

Question 12

Make covid_burden into a factor variable with levels in the order of “Low”, “Medium”, “High”. Notice if you don’t specify the levels then it will sort them alphabetically. Compare your counts of each level to check that you specified them correctly.

hospital_pop_data <- hospital_pop_data |> 
  mutate(
    covid_burden = cut(
      (hospitalized_covid_confirmed_patients / all_hospital_beds )*100,
      breaks=c(0,10,25,100),right=FALSE,labels=c("low","medium","high"))
  )
table(hospital_pop_data$covid_burden, useNA="ifany")

   low medium   high   <NA> 
 66392   9986   2969   9242