library(tidyverse) # The tidyverse package is a massive package that includes packages such as ggplot
library(janitor) # The janitor package has functions to help clean data, such as clean_names()
Day 2 Lab Part 1
Univariate plotting
Question 1
For this lab we will be exploring the Arthritis Treatment data. Visit the website, read the abstract, and browse the data dictionary to learn more about this data.
1a. Read and clean the data by running the two following code chunks.
<- read_csv("https://raw.githubusercontent.com/cosmos-uci-dshs/data/main/RheumArth_Tx_AgeComparisons.csv") arthritis_og
Rows: 530 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (14): ID, Age, AgeGp, Sex, Yrs_From_Dx, CDAI, CDAI_YN, DAS_28, DAS28_YN,...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
At this point we do not expect you to understand the code in the following chunk, we will learn data wrangling on a later date.
# clean_names() changes the variable names to fit tidyverse style guidelines
# factor() is re-coding the digits to meaningful labels
<- clean_names(arthritis_og) %>%
arthritis mutate(
sex = factor(
sex,levels = c(0,1),
labels = c("Female", "Male")
),cdai_yn = factor(
cdai_yn,levels = c(1,2),
labels = c("No", "Yes")
),das28_yn = factor(
das28_yn,levels = c(1,2),
labels = c("No", "Yes")
),steroids_gt_5 = factor(
steroids_gt_5,levels = c(0,1),
labels = c("No", "Yes")
),dmar_ds = factor(
dmar_ds,levels = c(0,1),
labels = c("No", "Yes")
),biologics = factor(
biologics,levels = c(0,1),
labels = c("No", "Yes")
),s_dmards = factor(
s_dmards,levels = c(0,1),
labels = c("No", "Yes")
),osteop_screen = factor(
osteop_screen,levels = c(0,1),
labels = c("No", "Yes")
),age_gp = factor(
age_gp,levels = c(1,2),
labels = c("control", "elderly")
) )
1b. Use head(arthritis, n = 10)
to view the first 10 rows of the data frame.
head(arthritis,n=10)
# A tibble: 10 × 14
id age age_gp sex yrs_from_dx cdai cdai_yn das_28 das28_yn
<dbl> <dbl> <fct> <fct> <dbl> <dbl> <fct> <dbl> <fct>
1 1 85 elderly Female 27 NA No NA No
2 2 86 elderly Female 27 23 Yes NA No
3 3 83 elderly Female 10 14.5 Yes NA No
4 4 83 elderly Female 9 NA No NA No
5 5 85 elderly Female NA NA No NA No
6 6 79 elderly Male NA NA No NA No
7 7 90 elderly Female 51 NA No NA No
8 8 90 elderly Female 11 40 Yes NA No
9 9 87 elderly Female 36 6 Yes NA No
10 10 82 elderly Female 4 NA No NA No
# ℹ 5 more variables: steroids_gt_5 <fct>, dmar_ds <fct>, biologics <fct>,
# s_dmards <fct>, osteop_screen <fct>
Note:
NA
means missing data.- The assumed variable type is under the variable name. A factor is a categorical variable with the grouping specified (e.g. cdai_yn is either yes or no) while a double is a numerical variable (e.g. age).
1c. Use glimspe(arthritis)
to see an overview of the data. How many observations and variables does this data have?
glimpse(arthritis)
Rows: 530
Columns: 14
$ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ age <dbl> 85, 86, 83, 83, 85, 79, 90, 90, 87, 82, 77, 86, 84, 76, …
$ age_gp <fct> elderly, elderly, elderly, elderly, elderly, elderly, el…
$ sex <fct> Female, Female, Female, Female, Female, Male, Female, Fe…
$ yrs_from_dx <dbl> 27, 27, 10, 9, NA, NA, 51, 11, 36, 4, 31, NA, 9, 10, 3, …
$ cdai <dbl> NA, 23.0, 14.5, NA, NA, NA, NA, 40.0, 6.0, NA, 0.0, NA, …
$ cdai_yn <fct> No, Yes, Yes, No, No, No, No, Yes, Yes, No, Yes, No, No,…
$ das_28 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2.44…
$ das28_yn <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, Yes,…
$ steroids_gt_5 <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, No, Yes, No, No,…
$ dmar_ds <fct> Yes, Yes, Yes, Yes, No, No, Yes, No, No, Yes, Yes, No, Y…
$ biologics <fct> No, No, Yes, No, No, No, Yes, No, Yes, No, No, No, No, Y…
$ s_dmards <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No, …
$ osteop_screen <fct> No, Yes, Yes, Yes, No, No, No, Yes, Yes, Yes, Yes, Yes, …
1d. What are the unique values that the variables age_gp
, and osteop_screen
can take? Hint: you can use the function unique()
to figure this out.
unique(arthritis$age_gp)
[1] elderly control
Levels: control elderly
unique(arthritis$osteop_screen)
[1] No Yes <NA>
Levels: No Yes
Question 2
Now we are going to investigate using some numerical and graphical summaries of our data.
2a. Use summary(arthitis)
to view quick summaries of each of the variables. What are some observations about the distribution of the data or missing data?
summary(arthritis)
id age age_gp sex yrs_from_dx
Min. : 1.0 Min. :42.00 control:459 Female:428 Min. : 1.0
1st Qu.:162.2 1st Qu.:54.00 elderly: 71 Male :102 1st Qu.: 3.0
Median :294.5 Median :59.00 Median : 7.0
Mean :290.6 Mean :60.69 Mean : 9.4
3rd Qu.:426.8 3rd Qu.:66.00 3rd Qu.:11.0
Max. :559.0 Max. :90.00 Max. :70.0
NA's :15
cdai cdai_yn das_28 das28_yn steroids_gt_5 dmar_ds
Min. : 0.0 No :324 Min. : 0.000 No :464 No :405 No :149
1st Qu.: 6.0 Yes:206 1st Qu.: 1.825 Yes: 66 Yes :124 Yes :380
Median :10.0 Median : 2.500 NA's: 1 NA's: 1
Mean :13.1 Mean : 2.923
3rd Qu.:17.0 3rd Qu.: 3.310
Max. :71.0 Max. :23.000
NA's :324 NA's :464
biologics s_dmards osteop_screen
No :332 No :502 No :216
Yes :197 Yes : 27 Yes :306
NA's: 1 NA's: 1 NA's: 8
cdai
has a lot of missing variables!
2b. To better understand the ages distribution of the patients, it would be helpful to have a visual. Make a one variable plot that helps visualize the distribution of age. What do you observe from the plot?
ggplot(arthritis) + geom_histogram(aes(age),bins=10,col="black",fill="gray")
Less older people!
2c. Create a plot to visualize the counts of females and males in this study. Use add the layer labs(x = "Sex", y = "Count", title = "Study participants by sex")
to specify informative titles. For all plots in this document make sure to manually specify the axis labels instead of using the defaults.
2d. Why would a histogram not be appropriate to visualize the counts of females and males in this study? What is a variable in this data that would be appropriate to visualize with a histogram?
2e. Choose a plot to summarize the distribution of the clinical disease activity indicator. Note the median cdai as well as extreme outliers.
ggplot(arthritis) + geom_histogram(aes(x=cdai),bins=20)
Warning: Removed 324 rows containing non-finite outside the scale range
(`stat_bin()`).
Multivariate plotting
Question 3
Of interest was whether elderly patients were less likely to have disease activity measured and less likely to received aggressive treatment. This means we want to look at one variable grouped by another variable.
3a. Below we use group_by()
and summarize()
to obtain the counts of each age group as well as the counts and percents of people within each group that did not have clinical disease activity measured.
<- arthritis %>%
summary_arthritis group_by(age_gp) %>%
summarize(
total_count = n(),
cdai_NA_count = sum(is.na(cdai)),
cdai_NA_per = 100 * cdai_NA_count / total_count
) summary_arthritis
# A tibble: 2 × 4
age_gp total_count cdai_NA_count cdai_NA_per
<fct> <int> <int> <dbl>
1 control 459 266 58.0
2 elderly 71 58 81.7
Does there appear to be an equal amount of cdai
missingness between the two groups? Did you use the counts or percentages to conclude this and why? Do you think this could lead to problems in the analysis?
Data appears to be not missing at random!
3b. Let’s plot this missingness. We want to plot relative cdai_yn
for each age_gp
. Chose an appropriate way to visualize this.
ggplot(summary_arthritis) + geom_col(aes(y=cdai_NA_per,x= age_gp)) + theme_bw()
Question 4
Now we want to investigate the relationship between years since diagnosis and clinical disease activity indicator.
4a. Create a plot to visualize the relationship between these two variables. What information does your plot provide you with?
ggplot(arthritis) + geom_point(aes(x=yrs_from_dx, y = cdai))
Warning: Removed 325 rows containing missing values or values outside the scale range
(`geom_point()`).
4b. Now we want to learn if the relationship differs for patients that were and were not on steroids at more than 5 mg daily. Create a plot to aid in this investigation and state your findings.
# two possible solutions
# facet wrap (maybe harder to see?)
ggplot(arthritis) + geom_point(aes(x=yrs_from_dx, y = cdai)) + facet_wrap(~steroids_gt_5)
Warning: Removed 325 rows containing missing values or values outside the scale range
(`geom_point()`).
# different colors
ggplot(arthritis) + geom_point(aes(x=yrs_from_dx, y = cdai,color = steroids_gt_5))
Warning: Removed 325 rows containing missing values or values outside the scale range
(`geom_point()`).
Question 5
Visualizing data is an art and there is not necessarily one perfect way. Compare your answers with you neighbors and debate your plotting choices.
Extra challenge: Spend some time looking up plotting options and making your plots more visually appealing (e.g. changing the theme, colors, font size, legend position, etc). Look up how to change your legend title and apply it to any of your plots with legends.
Question 6: Practicing file reading
We have seen how to read csv files and Excel files in lecture. But there is unfortunately many many types of files! Some examples are tab-separated files (.tsv
or .txt
files), STATA files, SAS files…
In the data
folder there is two files to allow you to practice reading data in multiple formats.
pizzadata.tsv
(Tab Separated File)RheumArth_Tx_AgeComparisons.dta
(STATA file)
<- read_tsv("data/pizzadata.tsv") pizza
Rows: 48620 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (5): pizza_name_id, order_date, pizza_size, pizza_category, pizza_name
dbl (5): pizza_id, order_id, quantity, unit_price, total_price
time (1): order_time
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
library(haven)
<- read_dta("data/RheumArth_Tx_AgeComparisons.dta") arth_stat
Since many different files can be available and it is not possible to memorize the functions for each of them, try to use Google or tidyverse documentation to find the best function to read these files into R.
Once you are done, can you answer this question?
Q1: How many unique pizza_name
values are there? Of the pizzas sold, which one is your favorite?
unique(pizza$pizza_name)
[1] "The Hawaiian Pizza"
[2] "The Classic Deluxe Pizza"
[3] "The Five Cheese Pizza"
[4] "The Italian Supreme Pizza"
[5] "The Mexicana Pizza"
[6] "The Thai Chicken Pizza"
[7] "The Prosciutto and Arugula Pizza"
[8] "The Barbecue Chicken Pizza"
[9] "The Greek Pizza"
[10] "The Spinach Supreme Pizza"
[11] "The Green Garden Pizza"
[12] "The Italian Capocollo Pizza"
[13] "The Spicy Italian Pizza"
[14] "The Spinach Pesto Pizza"
[15] "The Vegetables + Vegetables Pizza"
[16] "The Southwest Chicken Pizza"
[17] "The California Chicken Pizza"
[18] "The Pepperoni Pizza"
[19] "The Chicken Pesto Pizza"
[20] "The Big Meat Pizza"
[21] "The Soppressata Pizza"
[22] "The Four Cheese Pizza"
[23] "The Napolitana Pizza"
[24] "The Calabrese Pizza"
[25] "The Italian Vegetables Pizza"
[26] "The Mediterranean Pizza"
[27] "The Pepper Salami Pizza"
[28] "The Spinach and Feta Pizza"
[29] "The Sicilian Pizza"
[30] "The Chicken Alfredo Pizza"
[31] "The Pepperoni, Mushroom, and Peppers Pizza"
[32] "The Brie Carre Pizza"
Q2: Extra challenge Can you figure out how what is the name of the pizza with the most sales (More on this on day 3)?
%>% group_by(pizza_name) %>% summarize(sold = n()) %>% arrange(desc(sold)) %>% head(5) pizza
# A tibble: 5 × 2
pizza_name sold
<chr> <int>
1 The Classic Deluxe Pizza 2416
2 The Barbecue Chicken Pizza 2372
3 The Hawaiian Pizza 2370
4 The Pepperoni Pizza 2369
5 The Thai Chicken Pizza 2315
Classic Deluxe Pizza!
Once you have finished this lab you should save and commit your changes to Git, then pull and push to GitHub.