Changing Variables

ISI-BUDS 2023

Review

How many panes have you seen in RStudio and what is the purpose of each pane?

Which of the following files is a markdown file?

  1. example.R
  2. example.md
  3. example.Rmd

Which of the following is a valid order of actions when starting a project using git and GitHub?

  1. clone, commit, push,

  2. push, commit, clone

  3. commit, clone, push

  4. clone, push, commit

Which R functions have we learned together?

What is the formula for variance?

You are given a data frame called registrar. There are two variables you are interested in class_year which represents whether someone is a first year, sophomore, junior, or senior and gpa which represents GPA.

How would you find the average GPA for each class rank?

Changing Variables

Download and read in the Arthritis data in Excel format.

Edward Gracely, “Arthritis Treatment Dataset”, TSHS Resources Portal (2020)

glimpse(arthritis)
Rows: 530
Columns: 14
$ ID            <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ Age           <dbl> 85, 86, 83, 83, 85, 79, 90, 90, 87, 82, 77, 86, 84, 76, …
$ AgeGp         <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
$ Sex           <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Yrs_From_Dx   <dbl> 27, 27, 10, 9, NA, NA, 51, 11, 36, 4, 31, NA, 9, 10, 3, …
$ CDAI          <dbl> NA, 23.0, 14.5, NA, NA, NA, NA, 40.0, 6.0, NA, 0.0, NA, …
$ CDAI_YN       <dbl> 1, 2, 2, 1, 1, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1,…
$ DAS_28        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2.44…
$ DAS28_YN      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1,…
$ Steroids_GT_5 <dbl> 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
$ DMARDs        <dbl> 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1,…
$ Biologics     <dbl> 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
$ sDMARDS       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ OsteopScreen  <dbl> 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…

clean_names() makes variable names in tidy style.

arthritis <- clean_names(arthritis)
glimpse(arthritis)
Rows: 530
Columns: 14
$ id            <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ age           <dbl> 85, 86, 83, 83, 85, 79, 90, 90, 87, 82, 77, 86, 84, 76, …
$ age_gp        <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
$ sex           <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ yrs_from_dx   <dbl> 27, 27, 10, 9, NA, NA, 51, 11, 36, 4, 31, NA, 9, 10, 3, …
$ cdai          <dbl> NA, 23.0, 14.5, NA, NA, NA, NA, 40.0, 6.0, NA, 0.0, NA, …
$ cdai_yn       <dbl> 1, 2, 2, 1, 1, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1,…
$ das_28        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2.44…
$ das28_yn      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1,…
$ steroids_gt_5 <dbl> 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
$ dmar_ds       <dbl> 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1,…
$ biologics     <dbl> 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
$ s_dmards      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ osteop_screen <dbl> 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…

Goal:

Create a new variable called age_months that represents age in months.

arthritis %>% 
  mutate(age_months = age*12)
# A tibble: 530 × 15
      id   age age_gp   sex yrs_from_dx  cdai cdai_yn das_28 das28_yn
   <dbl> <dbl>  <dbl> <dbl>       <dbl> <dbl>   <dbl>  <dbl>    <dbl>
 1     1    85      2     0          27  NA         1     NA        1
 2     2    86      2     0          27  23         2     NA        1
 3     3    83      2     0          10  14.5       2     NA        1
 4     4    83      2     0           9  NA         1     NA        1
 5     5    85      2     0          NA  NA         1     NA        1
 6     6    79      2     1          NA  NA         1     NA        1
 7     7    90      2     0          51  NA         1     NA        1
 8     8    90      2     0          11  40         2     NA        1
 9     9    87      2     0          36   6         2     NA        1
10    10    82      2     0           4  NA         1     NA        1
# ℹ 520 more rows
# ℹ 6 more variables: steroids_gt_5 <dbl>, dmar_ds <dbl>, biologics <dbl>,
#   s_dmards <dbl>, osteop_screen <dbl>, age_months <dbl>

glimpse(arthritis)
Rows: 530
Columns: 14
$ id            <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ age           <dbl> 85, 86, 83, 83, 85, 79, 90, 90, 87, 82, 77, 86, 84, 76, …
$ age_gp        <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
$ sex           <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ yrs_from_dx   <dbl> 27, 27, 10, 9, NA, NA, 51, 11, 36, 4, 31, NA, 9, 10, 3, …
$ cdai          <dbl> NA, 23.0, 14.5, NA, NA, NA, NA, 40.0, 6.0, NA, 0.0, NA, …
$ cdai_yn       <dbl> 1, 2, 2, 1, 1, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1,…
$ das_28        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2.44…
$ das28_yn      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1,…
$ steroids_gt_5 <dbl> 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
$ dmar_ds       <dbl> 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1,…
$ biologics     <dbl> 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
$ s_dmards      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ osteop_screen <dbl> 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…

Goal:

Create a new variable called das_level which has the following categories based on das_28 scores.

remission < 2.6
2.6 \(\leq\) low disease activity \(\leq\) 3.2
3.2 < moderate disease activity \(leq\) 5.1

high disease activity > 5.1

arthritis %>% 
  mutate(das_level = case_when(
    das_28 < 2.6 ~ "remission", 
    das_28 >= 2.6 & das_28 <= 3.2 ~ "low disease activity",
    das_28 > 3.2 & das_28 <= 5.1 ~ "moderate disease activity",
    das_28 > 5.1 ~ "high disease activity")) 
# A tibble: 530 × 15
      id   age age_gp   sex yrs_from_dx  cdai cdai_yn das_28 das28_yn
   <dbl> <dbl>  <dbl> <dbl>       <dbl> <dbl>   <dbl>  <dbl>    <dbl>
 1     1    85      2     0          27  NA         1     NA        1
 2     2    86      2     0          27  23         2     NA        1
 3     3    83      2     0          10  14.5       2     NA        1
 4     4    83      2     0           9  NA         1     NA        1
 5     5    85      2     0          NA  NA         1     NA        1
 6     6    79      2     1          NA  NA         1     NA        1
 7     7    90      2     0          51  NA         1     NA        1
 8     8    90      2     0          11  40         2     NA        1
 9     9    87      2     0          36   6         2     NA        1
10    10    82      2     0           4  NA         1     NA        1
# ℹ 520 more rows
# ℹ 6 more variables: steroids_gt_5 <dbl>, dmar_ds <dbl>, biologics <dbl>,
#   s_dmards <dbl>, osteop_screen <dbl>, das_level <chr>

(Some) Variable Types in R

character: takes string values (e.g. a person’s name, address)
integer: integer (single precision)
double: floating decimal (double precision)
numeric: integer or double
factor: categorical variables with different levels
logical: TRUE (1), FALSE (0)

glimpse(arthritis)
Rows: 530
Columns: 14
$ id            <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ age           <dbl> 85, 86, 83, 83, 85, 79, 90, 90, 87, 82, 77, 86, 84, 76, …
$ age_gp        <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
$ sex           <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ yrs_from_dx   <dbl> 27, 27, 10, 9, NA, NA, 51, 11, 36, 4, 31, NA, 9, 10, 3, …
$ cdai          <dbl> NA, 23.0, 14.5, NA, NA, NA, NA, 40.0, 6.0, NA, 0.0, NA, …
$ cdai_yn       <dbl> 1, 2, 2, 1, 1, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1,…
$ das_28        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2.44…
$ das28_yn      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1,…
$ steroids_gt_5 <dbl> 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
$ dmar_ds       <dbl> 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1,…
$ biologics     <dbl> 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
$ s_dmards      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ osteop_screen <dbl> 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…

Goal:

Change das_28_yn and age_gp to appropriate variable types.

arthritis %>% 
  mutate(das28_yn = as.factor(das28_yn),
         age_gp = as.factor(age_gp)) 
# A tibble: 530 × 14
      id   age age_gp   sex yrs_from_dx  cdai cdai_yn das_28 das28_yn
   <dbl> <dbl> <fct>  <dbl>       <dbl> <dbl>   <dbl>  <dbl> <fct>   
 1     1    85 2          0          27  NA         1     NA 1       
 2     2    86 2          0          27  23         2     NA 1       
 3     3    83 2          0          10  14.5       2     NA 1       
 4     4    83 2          0           9  NA         1     NA 1       
 5     5    85 2          0          NA  NA         1     NA 1       
 6     6    79 2          1          NA  NA         1     NA 1       
 7     7    90 2          0          51  NA         1     NA 1       
 8     8    90 2          0          11  40         2     NA 1       
 9     9    87 2          0          36   6         2     NA 1       
10    10    82 2          0           4  NA         1     NA 1       
# ℹ 520 more rows
# ℹ 5 more variables: steroids_gt_5 <dbl>, dmar_ds <dbl>, biologics <dbl>,
#   s_dmards <dbl>, osteop_screen <dbl>

as.factor() - makes a vector factor
as.numeric() - makes a vector numeric
as.integer() - makes a vector integer
as.double() - makes a vector double
as.character() - makes a vector character

In your lecture notes, you can do all the changes in this lecture in one long set of piped code. That’s the beauty of piping!

arthritis <- 
  arthritis %>% 
  clean_names() %>% 
    mutate(das_level = case_when(
    das_28 < 2.6 ~ "remission", 
    das_28 >= 2.6 & das_28 <= 3.2 ~ "low disease activity",
    das_28 > 3.2 & das_28 <= 5.1 ~ "moderate disease activity",
    das_28 > 5.1 ~ "high disease activity")) %>% 
  mutate(das28_yn = as.factor(das28_yn),
         age_gp = as.factor(age_gp)) 

Word of caution

The functions clean_names(), and mutate() all take a data frame as the first argument. Even though we do not see it, the data frame is piped through from the previous step of code at each step. When we use these functions without the %>% we have to include the data frame explicitly.

Data frame is used as the first argument

clean_names(arthritis)
# A tibble: 530 × 14
      id   age age_gp   sex yrs_from_dx  cdai cdai_yn das_28 das28_yn
   <dbl> <dbl>  <dbl> <dbl>       <dbl> <dbl>   <dbl>  <dbl>    <dbl>
 1     1    85      2     0          27  NA         1     NA        1
 2     2    86      2     0          27  23         2     NA        1
 3     3    83      2     0          10  14.5       2     NA        1
 4     4    83      2     0           9  NA         1     NA        1
 5     5    85      2     0          NA  NA         1     NA        1
 6     6    79      2     1          NA  NA         1     NA        1
 7     7    90      2     0          51  NA         1     NA        1
 8     8    90      2     0          11  40         2     NA        1
 9     9    87      2     0          36   6         2     NA        1
10    10    82      2     0           4  NA         1     NA        1
# ℹ 520 more rows
# ℹ 5 more variables: steroids_gt_5 <dbl>, dmar_ds <dbl>, biologics <dbl>,
#   s_dmards <dbl>, osteop_screen <dbl>

Data frame is piped

arthritis %>% 
  clean_names()
# A tibble: 530 × 14
      id   age age_gp   sex yrs_from_dx  cdai cdai_yn das_28 das28_yn
   <dbl> <dbl>  <dbl> <dbl>       <dbl> <dbl>   <dbl>  <dbl>    <dbl>
 1     1    85      2     0          27  NA         1     NA        1
 2     2    86      2     0          27  23         2     NA        1
 3     3    83      2     0          10  14.5       2     NA        1
 4     4    83      2     0           9  NA         1     NA        1
 5     5    85      2     0          NA  NA         1     NA        1
 6     6    79      2     1          NA  NA         1     NA        1
 7     7    90      2     0          51  NA         1     NA        1
 8     8    90      2     0          11  40         2     NA        1
 9     9    87      2     0          36   6         2     NA        1
10    10    82      2     0           4  NA         1     NA        1
# ℹ 520 more rows
# ℹ 5 more variables: steroids_gt_5 <dbl>, dmar_ds <dbl>, biologics <dbl>,
#   s_dmards <dbl>, osteop_screen <dbl>