Rows: 67000 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): county
dbl (7): hospitalized_covid_confirmed_patients, hospitalized_suspected_covi...
date (1): todays_date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Use glimpse() to get a preview of the data and view the columns.
Say we want to compare average and median number of COVID confirmed hospitalizations for the counties of Los Angeles, Orange, San Francisco, Sonoma, and San Diego. (You may find it helpful to use the function %in%)
# A tibble: 5 × 3
county ave_covid_hosp median_covid_hosp
<chr> <dbl> <dbl>
1 Los Angeles 1257. 763
2 Orange 332. 197
3 San Diego 350. 258
4 San Francisco 75.0 63
5 Sonoma 32.3 28
Trouble with filter
There are some other packages with functions named filter() or select() and if those packages are loaded most recently then sometimes you can get problems. In the future when you are doing data cleaning if filter() or select() are not working but you are confident you have called them corretly, check to make sure tidyverse is your most recently loaded package.
It does not necessarily make sense to compare raw counts because these counties do not have similar populations. Load in the county population data and join it with the hospital data. (I recommend you do not save over you data, instead make a new data frame for the combined data)
county_pop <-read_csv("county-pop.csv") %>%rename(county = County, population = Population)
Rows: 58 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): County
dbl (1): Population
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
comb_hosp_pop <-full_join(x = hosp_data, y = county_pop)
Joining with `by = join_by(county)`
When you join data you always want to check that your joined data set has the expected number of rows and columns, if not, you may have used the wrong join function or your data may be missing values or have extra values.
Here we care about the COVID hospital data, and are using the county population data to add info. There are more counties present in the county population data than the COVID hospital data. We only want to keep the info for the counties present in the COVID hosptial data. Consider which join is most appropriate for this, and change your join function accordingly.
comb_hosp_pop <-left_join(x = hosp_data, y = county_pop)
Joining with `by = join_by(county)`
Now that our data is joined, make a new variable which records daily percent of the population that is covid confirmed in the hospital for each county.
Compute average daily percent of the county with COVID confirmed hospitalizations for the counties of Los Angeles, Orange, San Francisco, Sonoma, and San Diego.
# A tibble: 5 × 2
county ave_per_covid_hosp
<chr> <dbl>
1 Los Angeles 0.0125
2 Orange 0.0105
3 San Diego 0.0105
4 San Francisco 0.00851
5 Sonoma 0.00654
This data has a wide date range, let’s narrow it down to look at the previously computed averages specifically for records between December 2020 and February 2021.
# A tibble: 5 × 2
county ave_per_covid_hosp
<chr> <dbl>
1 Los Angeles 0.0607
2 Orange 0.0536
3 San Diego 0.0401
4 San Francisco 0.0207
5 Sonoma 0.0163
What do you notice when you compare the average daily percent hospitalized with confirmed covid for the entire time range with that for the selected few months?
Rows: 67000 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): county
dbl (7): hospitalized_covid_confirmed_patients, hospitalized_suspected_covi...
date (1): todays_date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ca_covid_hosp_data %>%ggplot(aes(fill = county, x = todays_date, y = hospitalized_covid_confirmed_patients)) +geom_bar(position="stack", stat="identity")
The above plot tells us something about CA trends, but prevents us from comparing trends among counties, in addition to being absolutely hideous. Let’s focus on just five counties.
ca_five_county_covid_hosp_data <- ca_covid_hosp_data %>%filter(county %in%c("Los Angeles", "Orange", "Sacramento", "Santa Clara", "San Francisco"))ca_five_county_covid_hosp_data %>%ggplot(aes(fill = county, x = todays_date, y = hospitalized_covid_confirmed_patients)) +geom_bar(position ="stack", stat ="identity")
Your task
The 5 county graph is more readable than the first graph, but still has tons of problems if one really wants to compare COVID-19 hospitalization trends across CA counties. Create your own visualization of the 5 county data, remembering best practices that we talked about in the lecture. There are more than one way of doing this, so don’t be inhibited by trying to think of “the right” solution. Also, depending on what information you want to convey with your plot, you may consider making and plotting a new variable scaled by population of the county.