Day 2 Lab Part 1

Univariate plotting

library(tidyverse) # The tidyverse package is a massive package that includes packages such as ggplot

library(janitor) # The janitor package has functions to help clean data, such as clean_names()

Question 1

For this lab we will be exploring the Arthritis Treatment data. Visit the website, read the abstract, and browse the data dictionary to learn more about this data.

1a. Read and clean the data by running the two following code chunks.

arthritis_og <- read_csv("https://raw.githubusercontent.com/cosmos-uci-dshs/data/main/RheumArth_Tx_AgeComparisons.csv")

Rows: 530 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (14): ID, Age, AgeGp, Sex, Yrs_From_Dx, CDAI, CDAI_YN, DAS_28, DAS28_YN,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

At this point we do not expect you to understand the code in the following chunk, we will learn data wrangling on a later date.

# clean_names() changes the variable names to fit tidyverse style guidelines
# factor() is re-coding the digits to meaningful labels

arthritis <- clean_names(arthritis_og) %>%
  mutate(
    sex = factor(
      sex,
      levels = c(0,1),
      labels = c("Female", "Male")
    ),
    cdai_yn = factor(
      cdai_yn,
      levels = c(1,2),
      labels = c("No", "Yes")
    ),
    das28_yn = factor(
      das28_yn,
      levels = c(1,2),
      labels = c("No", "Yes")
    ),
    steroids_gt_5 = factor(
      steroids_gt_5,
      levels = c(0,1),
      labels = c("No", "Yes")
    ),
    dmar_ds = factor(
      dmar_ds,
      levels = c(0,1),
      labels = c("No", "Yes")
    ),
    biologics = factor(
      biologics,
      levels = c(0,1),
      labels = c("No", "Yes")
    ),
    s_dmards = factor(
      s_dmards,
      levels = c(0,1),
      labels = c("No", "Yes")
    ),
    osteop_screen = factor(
      osteop_screen,
      levels = c(0,1),
      labels = c("No", "Yes")
    ),
    age_gp = factor(
      age_gp,
      levels = c(1,2),
      labels = c("control", "elderly")
    )
  )

1b. Use head(arthritis, n = 10) to view the first 10 rows of the data frame.

head(arthritis,n=10)

# A tibble: 10 × 14
      id   age age_gp  sex    yrs_from_dx  cdai cdai_yn das_28 das28_yn
   <dbl> <dbl> <fct>   <fct>        <dbl> <dbl> <fct>    <dbl> <fct>   
 1     1    85 elderly Female          27  NA   No          NA No      
 2     2    86 elderly Female          27  23   Yes         NA No      
 3     3    83 elderly Female          10  14.5 Yes         NA No      
 4     4    83 elderly Female           9  NA   No          NA No      
 5     5    85 elderly Female          NA  NA   No          NA No      
 6     6    79 elderly Male            NA  NA   No          NA No      
 7     7    90 elderly Female          51  NA   No          NA No      
 8     8    90 elderly Female          11  40   Yes         NA No      
 9     9    87 elderly Female          36   6   Yes         NA No      
10    10    82 elderly Female           4  NA   No          NA No      
# ℹ 5 more variables: steroids_gt_5 <fct>, dmar_ds <fct>, biologics <fct>,
#   s_dmards <fct>, osteop_screen <fct>

Note:

NA means missing data.
The assumed variable type is under the variable name. A factor is a categorical variable with the grouping specified (e.g. cdai_yn is either yes or no) while a double is a numerical variable (e.g. age).

1c. Use glimspe(arthritis) to see an overview of the data. How many observations and variables does this data have?

glimpse(arthritis)

Rows: 530
Columns: 14
$ id            <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ age           <dbl> 85, 86, 83, 83, 85, 79, 90, 90, 87, 82, 77, 86, 84, 76, …
$ age_gp        <fct> elderly, elderly, elderly, elderly, elderly, elderly, el…
$ sex           <fct> Female, Female, Female, Female, Female, Male, Female, Fe…
$ yrs_from_dx   <dbl> 27, 27, 10, 9, NA, NA, 51, 11, 36, 4, 31, NA, 9, 10, 3, …
$ cdai          <dbl> NA, 23.0, 14.5, NA, NA, NA, NA, 40.0, 6.0, NA, 0.0, NA, …
$ cdai_yn       <fct> No, Yes, Yes, No, No, No, No, Yes, Yes, No, Yes, No, No,…
$ das_28        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2.44…
$ das28_yn      <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, Yes,…
$ steroids_gt_5 <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, No, Yes, No, No,…
$ dmar_ds       <fct> Yes, Yes, Yes, Yes, No, No, Yes, No, No, Yes, Yes, No, Y…
$ biologics     <fct> No, No, Yes, No, No, No, Yes, No, Yes, No, No, No, No, Y…
$ s_dmards      <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No, …
$ osteop_screen <fct> No, Yes, Yes, Yes, No, No, No, Yes, Yes, Yes, Yes, Yes, …

1d. What are the unique values that the variables age_gp, and osteop_screencan take? Hint: you can use the function unique() to figure this out.

unique(arthritis$age_gp)

[1] elderly control
Levels: control elderly

unique(arthritis$osteop_screen)

[1] No   Yes  <NA>
Levels: No Yes

Question 2

Now we are going to investigate using some numerical and graphical summaries of our data.

2a. Use summary(arthitis) to view quick summaries of each of the variables. What are some observations about the distribution of the data or missing data?

summary(arthritis)

       id             age            age_gp        sex       yrs_from_dx  
 Min.   :  1.0   Min.   :42.00   control:459   Female:428   Min.   : 1.0  
 1st Qu.:162.2   1st Qu.:54.00   elderly: 71   Male  :102   1st Qu.: 3.0  
 Median :294.5   Median :59.00                              Median : 7.0  
 Mean   :290.6   Mean   :60.69                              Mean   : 9.4  
 3rd Qu.:426.8   3rd Qu.:66.00                              3rd Qu.:11.0  
 Max.   :559.0   Max.   :90.00                              Max.   :70.0  
                                                            NA's   :15    
      cdai      cdai_yn       das_28       das28_yn  steroids_gt_5 dmar_ds   
 Min.   : 0.0   No :324   Min.   : 0.000   No :464   No  :405      No  :149  
 1st Qu.: 6.0   Yes:206   1st Qu.: 1.825   Yes: 66   Yes :124      Yes :380  
 Median :10.0             Median : 2.500             NA's:  1      NA's:  1  
 Mean   :13.1             Mean   : 2.923                                     
 3rd Qu.:17.0             3rd Qu.: 3.310                                     
 Max.   :71.0             Max.   :23.000                                     
 NA's   :324              NA's   :464                                        
 biologics  s_dmards   osteop_screen
 No  :332   No  :502   No  :216     
 Yes :197   Yes : 27   Yes :306     
 NA's:  1   NA's:  1   NA's:  8

cdai has a lot of missing variables!

2b. To better understand the ages distribution of the patients, it would be helpful to have a visual. Make a one variable plot that helps visualize the distribution of age. What do you observe from the plot?

ggplot(arthritis) + geom_histogram(aes(age),bins=10,col="black",fill="gray")

Less older people!

2c. Create a plot to visualize the counts of females and males in this study. Use add the layer labs(x = "Sex", y = "Count", title = "Study participants by sex") to specify informative titles. For all plots in this document make sure to manually specify the axis labels instead of using the defaults.

2d. Why would a histogram not be appropriate to visualize the counts of females and males in this study? What is a variable in this data that would be appropriate to visualize with a histogram?

2e. Choose a plot to summarize the distribution of the clinical disease activity indicator. Note the median cdai as well as extreme outliers.

ggplot(arthritis) + geom_histogram(aes(x=cdai),bins=20)

Warning: Removed 324 rows containing non-finite outside the scale range
(`stat_bin()`).

Multivariate plotting

Question 3

Of interest was whether elderly patients were less likely to have disease activity measured and less likely to received aggressive treatment. This means we want to look at one variable grouped by another variable.

3a. Below we use group_by() and summarize() to obtain the counts of each age group as well as the counts and percents of people within each group that did not have clinical disease activity measured.

summary_arthritis <- arthritis %>% 
  group_by(age_gp) %>% 
  summarize(
    total_count = n(),
    cdai_NA_count = sum(is.na(cdai)),
    cdai_NA_per = 100 * cdai_NA_count / total_count
  )
summary_arthritis

# A tibble: 2 × 4
  age_gp  total_count cdai_NA_count cdai_NA_per
  <fct>         <int>         <int>       <dbl>
1 control         459           266        58.0
2 elderly          71            58        81.7

Does there appear to be an equal amount of cdai missingness between the two groups? Did you use the counts or percentages to conclude this and why? Do you think this could lead to problems in the analysis?

Data appears to be not missing at random!

3b. Let’s plot this missingness. We want to plot relative cdai_yn for each age_gp. Chose an appropriate way to visualize this.

ggplot(summary_arthritis) + geom_col(aes(y=cdai_NA_per,x= age_gp)) + theme_bw()

Question 4

Now we want to investigate the relationship between years since diagnosis and clinical disease activity indicator.

4a. Create a plot to visualize the relationship between these two variables. What information does your plot provide you with?

ggplot(arthritis) + geom_point(aes(x=yrs_from_dx, y = cdai))

Warning: Removed 325 rows containing missing values or values outside the scale range
(`geom_point()`).

4b. Now we want to learn if the relationship differs for patients that were and were not on steroids at more than 5 mg daily. Create a plot to aid in this investigation and state your findings.

# two possible solutions

# facet wrap (maybe harder to see?)
ggplot(arthritis) + geom_point(aes(x=yrs_from_dx, y = cdai)) + facet_wrap(~steroids_gt_5)

Warning: Removed 325 rows containing missing values or values outside the scale range
(`geom_point()`).

# different colors
ggplot(arthritis) + geom_point(aes(x=yrs_from_dx, y = cdai,color = steroids_gt_5))

Warning: Removed 325 rows containing missing values or values outside the scale range
(`geom_point()`).

Question 5

Visualizing data is an art and there is not necessarily one perfect way. Compare your answers with you neighbors and debate your plotting choices.

Extra challenge: Spend some time looking up plotting options and making your plots more visually appealing (e.g. changing the theme, colors, font size, legend position, etc). Look up how to change your legend title and apply it to any of your plots with legends.

Question 6: Practicing file reading

We have seen how to read csv files and Excel files in lecture. But there is unfortunately many many types of files! Some examples are tab-separated files (.tsv or .txt files), STATA files, SAS files…

In the data folder there is two files to allow you to practice reading data in multiple formats.

pizzadata.tsv (Tab Separated File)
RheumArth_Tx_AgeComparisons.dta (STATA file)

pizza <- read_tsv("data/pizzadata.tsv")

Rows: 48620 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr  (5): pizza_name_id, order_date, pizza_size, pizza_category, pizza_name
dbl  (5): pizza_id, order_id, quantity, unit_price, total_price
time (1): order_time

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

library(haven)
arth_stat <- read_dta("data/RheumArth_Tx_AgeComparisons.dta")

Since many different files can be available and it is not possible to memorize the functions for each of them, try to use Google or tidyverse documentation to find the best function to read these files into R.

Once you are done, can you answer this question?

Q1: How many unique pizza_name values are there? Of the pizzas sold, which one is your favorite?

unique(pizza$pizza_name)

 [1] "The Hawaiian Pizza"                        
 [2] "The Classic Deluxe Pizza"                  
 [3] "The Five Cheese Pizza"                     
 [4] "The Italian Supreme Pizza"                 
 [5] "The Mexicana Pizza"                        
 [6] "The Thai Chicken Pizza"                    
 [7] "The Prosciutto and Arugula Pizza"          
 [8] "The Barbecue Chicken Pizza"                
 [9] "The Greek Pizza"                           
[10] "The Spinach Supreme Pizza"                 
[11] "The Green Garden Pizza"                    
[12] "The Italian Capocollo Pizza"               
[13] "The Spicy Italian Pizza"                   
[14] "The Spinach Pesto Pizza"                   
[15] "The Vegetables + Vegetables Pizza"         
[16] "The Southwest Chicken Pizza"               
[17] "The California Chicken Pizza"              
[18] "The Pepperoni Pizza"                       
[19] "The Chicken Pesto Pizza"                   
[20] "The Big Meat Pizza"                        
[21] "The Soppressata Pizza"                     
[22] "The Four Cheese Pizza"                     
[23] "The Napolitana Pizza"                      
[24] "The Calabrese Pizza"                       
[25] "The Italian Vegetables Pizza"              
[26] "The Mediterranean Pizza"                   
[27] "The Pepper Salami Pizza"                   
[28] "The Spinach and Feta Pizza"                
[29] "The Sicilian Pizza"                        
[30] "The Chicken Alfredo Pizza"                 
[31] "The Pepperoni, Mushroom, and Peppers Pizza"
[32] "The Brie Carre Pizza"

Q2: Extra challenge Can you figure out how what is the name of the pizza with the most sales (More on this on day 3)?

pizza %>% group_by(pizza_name) %>% summarize(sold = n()) %>% arrange(desc(sold)) %>% head(5)

# A tibble: 5 × 2
  pizza_name                  sold
  <chr>                      <int>
1 The Classic Deluxe Pizza    2416
2 The Barbecue Chicken Pizza  2372
3 The Hawaiian Pizza          2370
4 The Pepperoni Pizza         2369
5 The Thai Chicken Pizza      2315

Classic Deluxe Pizza!

Once you have finished this lab you should save and commit your changes to Git, then pull and push to GitHub.