library(tidyverse)
library(janitor)
library(here)
<- read_csv(
hospital_data_og ::here("week1-material/day04-wrangle2-dates/data/covid19hospitalbycounty.csv"),
hereshow_col_types = FALSE
)
<- read_csv(
population_data_og ::here("week1-material/day04-wrangle2-dates/data/county-population-census2020.csv"),
hereshow_col_types = FALSE
)
# This code is from lab 3
<- population_data_og |>
population_data_1row clean_names() |>
filter(label_grouping == "Total:") |>
select(-label_grouping)
<- population_data_1row |>
population_data_uncleaned pivot_longer(cols = colnames(population_data_1row)) |>
rename(county = name, population = value)
Day 04: Lab
In this lab we will revisit the hospital COVID data again.
Note the use of here()
to read in data that is stored in a different folder than this code file. Check what happens if you try to just run hospital_data_og <- read_csv("data/covid19hospitalbycounty.csv")
.
Manipulating county
variable
In lab 3 we provided the code to do string cleaning of the county column in the population data. Recall the population data from the US Census had the ugly naming convention, and we changed the hospital county names to match that convention.
Question 1
We are now going to take a step by step look at that cleaning. Make sure you understand what we did in the code chunk below. Refer to your lecture notes or look at help documentation for functions you are unsure about.
<- hospital_data_og |>
hospital_data mutate(county = str_to_lower(county)) |>
mutate(county = str_replace_all(county, pattern = " ", replacement = "_")) |>
mutate(county = str_c(county,"_county_california"))
Question 2
The naming convention used by the census made sense when it contained data for all counties in all states, but here we are only considering California counties so the naming is redundant. Let’s clean our county
column in the population data to match the naming convention for the county
column in the original hospital data.
Remove “_county_california” from each name in the county
column of population_data_uncleaned
and save the new data frame into an object named population_data1
.
<- population_data_uncleaned |>
population_data1 mutate(county = str_remove(county,"_county_california"))
Question 3
Replace all “_” with ” ” in each name in the county
column of population_data1
and save the new data frame into an object named population_data2
.
<- population_data1 |>
population_data2 mutate(county = str_replace_all(county,"_"," "))
Question 4
Capitalize the first letter of each word in each name in the county
column of population_data2
and save the new data frame into an object named population_data3
.
<- population_data2 |>
population_data3 mutate(county = str_to_title(county))
Question 5
Use anti_join()
to check that you did the string manipulations correctly. If the county names match, this should return a data frame with two rows for those counties present in the population data and not present in the hospital data.
anti_join(population_data3,hospital_data_og)
Joining with `by = join_by(county)`
# A tibble: 2 × 2
county population
<chr> <dbl>
1 Alpine 1204
2 Sierra 3236
Alpine and Sierra counties did not find a match.
Question 6
Notice in questions 5-7 I had you make a new data frame each time you made a change to the county data. I find this helpful when I am mutating existing columns in my data and want to check that I am doing it correctly, but this can cause confusion later on, especially for long code scripts/files. One could easily forget the difference between population_data_uncleaned
, population_data1
, population_data2
, and population_data3
, and accidentally use the wrong one.
Once you are confident in your code it is better practice to do data cleaning using pipes which save into a single data frame with clear naming, instead individual data frames for each mutate() call. Combine your code from questions 5-7 into a single pipe (like in question 4) that begins with population_data_uncleaned
and saves into an object named population_data_cleaned
.
<- population_data_uncleaned |>
population_data_cleaned mutate(county = str_remove(county,"_county_california")) |>
mutate(county = str_replace_all(county,"_"," ")) |>
mutate(county = str_to_title(county))
Question 7
Now join the hospital and population data into a single data frame (with an understandable name!).
<- left_join(x=hospital_data_og,population_data_cleaned) hospital_pop_data
Joining with `by = join_by(county)`
New numeric variable
Question 8
Like in lab 11, make a new variable which records daily percent of the population that is COVID confirmed in the hospital for each county.
<- hospital_pop_data |> mutate(pct_hospital = hospitalized_covid_confirmed_patients / population *100) hospital_pop_data
Saving your cleaned data
Question 9
Arrange your data by date and then county using arrange()
.
<- hospital_pop_data |> arrange(county) hospital_pop_data
Question 10
Save your cleaned combined data frame into the data folder using write_csv()
. Make sure to follow the naming convention discussed in today’s lecture.
Extra challenge: New factor variable (optional)
I recommend tackling this section once you finish labs part 1 and part 2, if you have time.
Question 11
Make a new variable called covid_burden
with three categories which correspond to:
- “low” when the fraction of hospitalized COVID confirmed patients out of all hospital beds is less than 10%
- “medium” when the fraction of hospitalized COVID confirmed patients out of all hospital beds is greater than or equal to 10 % and less than 25%
- “high” when the fraction of hospitalized COVID confirmed patients out of all hospital beds is greater than or equal to 25%
Make all remaining rows have value NA
for this variable (these would be any row with an NA
for either hospitalized_covid_confirmed_patients or all_hospital_beds).
To check you did this right you should have 2,969 high, 9,986 medium, 66,392 low, and 9,242 NA
. You can use table(combined_covid_hosp_pop_data$covid_burden, useNA = "ifany")
to check.
<- hospital_pop_data |>
hospital_pop_data mutate(
covid_burden = cut(
/ all_hospital_beds )*100,
(hospitalized_covid_confirmed_patients breaks=c(0,10,25,100),right=FALSE)
)table(hospital_pop_data$covid_burden, useNA="ifany")
[0,10) [10,25) [25,100) <NA>
66392 9986 2969 9242
Question 12
Make covid_burden
into a factor variable with levels in the order of “Low”, “Medium”, “High”. Notice if you don’t specify the levels then it will sort them alphabetically. Compare your counts of each level to check that you specified them correctly.
<- hospital_pop_data |>
hospital_pop_data mutate(
covid_burden = cut(
/ all_hospital_beds )*100,
(hospitalized_covid_confirmed_patients breaks=c(0,10,25,100),right=FALSE,labels=c("low","medium","high"))
)table(hospital_pop_data$covid_burden, useNA="ifany")
low medium high <NA>
66392 9986 2969 9242