Day 2 Lab Part 2

Visualizing California COVID-19 hospital data

Today we are also going to work with California COVID hospital data. This is available in the data folder of the repository you cloned. Try reading in the data using an appropriate path!

Question 1

x 1a. Begin by reading in the data by running the chunk below.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
ca_covid_hosp_data <- read_csv("data/covid19hospitalbycounty.csv")
Rows: 67000 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): county
dbl  (7): hospitalized_covid_confirmed_patients, hospitalized_suspected_covi...
date (1): todays_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

1b. Use glimpse() to get a preview of the data and view the columns.

glimpse(ca_covid_hosp_data)
Rows: 67,000
Columns: 9
$ county                                <chr> "Kern", "Kern", "Shasta", "El Do…
$ todays_date                           <date> 2020-03-27, 2020-03-29, 2020-03…
$ hospitalized_covid_confirmed_patients <dbl> 0, 16, 0, 0, 74, 24, 1, 0, 1, 20…
$ hospitalized_suspected_covid_patients <dbl> 0, 57, 0, 23, 167, 85, 10, 9, 5,…
$ hospitalized_covid_patients           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ all_hospital_beds                     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ icu_covid_confirmed_patients          <dbl> 0, 8, 0, 0, 31, 5, 0, 0, 0, 9, 1…
$ icu_suspected_covid_patients          <dbl> 0, 8, 0, 12, 40, 19, 0, 0, 0, 8,…
$ icu_available_beds                    <dbl> NA, 39, NA, 11, 131, 14, 18, NA,…

1c. How many observations and variables are in this data set? Which variables are categorical and which are numeric?

ca_covid_hosp_data %>% nrow()
[1] 67000
ca_covid_hosp_data %>% ncol()
[1] 9

Question 2

2a. Run the following code chunk to make a bar plot of the number of hospitalized COVID confirmed patients by date and colored by county.

ggplot(
  data = ca_covid_hosp_data,
  aes(
    x = todays_date,
    y = hospitalized_covid_confirmed_patients,
    fill = county
  )
) +   
  geom_bar(position = "stack", stat = "identity")
Warning: Removed 8 rows containing missing values or values outside the scale range
(`geom_bar()`).

2b.

The above plot tells us something about CA trends, but prevents us from comparing trends among counties, in addition to being absolutely hideous. Let’s focus on just five counties.

ca_five_county_covid_hosp_data <- ca_covid_hosp_data %>%   
  filter(county %in% c("Los Angeles", "Orange", "Sacramento", "Santa Clara", "San Francisco"))  

Redo the plot from question 2a with this subsetted data.

ggplot(
  data = ca_five_county_covid_hosp_data,
  aes(
    x = todays_date,
    y = hospitalized_covid_confirmed_patients,
    fill = county
  )
) +   
  geom_bar(position = "stack", stat = "identity")

ggplot(
  data = ca_five_county_covid_hosp_data,
  aes(
    x = todays_date,
    y =icu_covid_confirmed_patients,
    fill = county
  )
) +   
  geom_bar(position = "stack", stat = "identity")

Question 3

The 5 county graph from question 2b is more readable than the graph in 2a, but still has tons of problems if one really wants to compare COVID-19 hospitalization trends across CA counties. Create your own visualization of the 5 county data, remembering best practices that we talked about in the lecture. There is more than one way of doing this, so don’t be inhibited by trying to think of “the right” solution.

ggplot(
  data = ca_five_county_covid_hosp_data,
  aes(
    x = todays_date,
    y = hospitalized_covid_confirmed_patients,
  )
) +   
  geom_line() + facet_wrap(~county,ncol=1) + ggtitle("Hospitalized Covid Patients (Confirmed)") + 
  ylab("Number of hospitalized COVID patients") + xlab("Date")

ggplot(
  data = ca_five_county_covid_hosp_data,
  aes(
    x = todays_date,
    y = hospitalized_covid_confirmed_patients,
  )
) +   
  geom_line() + facet_wrap(~county) + ggtitle("Hospitalized Covid Patients (Confirmed)") + 
  ylab("Number of hospitalized COVID patients") + xlab("Date")

Question 4: Pizza dataset

Returning to the Pizza dataset we used in the previous lab. We want to create a graph that see visualize, and quickly see, whether there are differences in price by pizza category. Before even starting to code, what are some types of graphs that you think can show this?

pizza <- read_tsv("data/pizzadata.tsv")
Rows: 48620 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr  (5): pizza_name_id, order_date, pizza_size, pizza_category, pizza_name
dbl  (5): pizza_id, order_id, quantity, unit_price, total_price
time (1): order_time

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Following these steps:

  1. What are the different categories of pizza ? (See variable pizza_category)?
unique(pizza$pizza_category)
[1] "Classic" "Veggie"  "Supreme" "Chicken"
  1. What is an appropriate plot to see the total_price of all pizzas?
ggplot(data=pizza) + geom_histogram(aes(x=total_price),fill="white", color="black")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

A common ggplot mistakes…

ggplot(data=pizza) + geom_histogram(aes(x=total_price,fill="white", color="black"))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  1. How would you plot total_price by pizza_category to illustrate prices by pizza category?
ggplot(data=pizza) + geom_histogram(aes(x=total_price,fill=pizza_category), color="black")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Now, let’s first ensure pizza_size is a factor.

pizza$pizza_size <- as.factor(pizza$pizza_size)

If you make the same plot as above, but check how total_price by pizza_size, what do you see?

ggplot(data=pizza) + geom_histogram(aes(x=total_price,fill=pizza_size),color="black")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data=pizza) + geom_histogram(aes(x=total_price,fill=pizza_size), bins= 60,color="black")

ggplot(data=pizza) + geom_density(aes(x=total_price,fill=pizza_size), color="black") + xlim(c(5,60))
Warning: Removed 6 rows containing non-finite outside the scale range
(`stat_density()`).

Check with your classmates if you have made the same plot or different plots to answer this question!