Visualizing Data with ggplot

ISI-BUDS 2023

Data Visualizations

  • are graphical representations of data

  • use different colors, shapes, and the coordinate system to summarize data

  • tell a story

  • are useful for exploring data

Data

head(titanic)
  survived pclass    sex age    fare    embarked
1    FALSE  Third   male  22  7.2500 Southampton
2     TRUE  First female  38 71.2833   Cherbourg
3     TRUE  Third female  26  7.9250 Southampton
4     TRUE  First female  35 53.1000 Southampton
5    FALSE  Third   male  35  8.0500 Southampton
6    FALSE  Third   male  NA  8.4583  Queenstown

Data

tail(titanic)
    survived pclass    sex age   fare    embarked
886    FALSE  Third female  39 29.125  Queenstown
887    FALSE Second   male  27 13.000 Southampton
888     TRUE  First female  19 30.000 Southampton
889    FALSE  Third female  NA 23.450 Southampton
890     TRUE  First   male  26 30.000   Cherbourg
891    FALSE  Third   male  32  7.750  Queenstown

Data

glimpse(titanic)
Rows: 891
Columns: 6
$ survived <lgl> FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, TR…
$ pclass   <chr> "Third", "First", "Third", "First", "Third", "Third", "First"…
$ sex      <fct> male, female, female, female, male, male, male, male, female,…
$ age      <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, 55,…
$ fare     <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625, 21…
$ embarked <fct> Southampton, Cherbourg, Southampton, Southampton, Southampton…

Visuals with a Single Categorical Variable

If you could speak to R in English, how would you tell R to make this plot for you?

OR

If you had the data and had to draw this bar plot by hand, what would you do?

3 Steps of Making a Basic ggplot

1.Pick data

2.Map data onto aesthetics

3.Add the geometric layer

Step 1 - Pick Data

ggplot(data = titanic)

Step 2 - Map Data to Aesthetics

ggplot(data = titanic,
       aes(x = pclass)) 

Step 3 - Add the Geometric Layer

ggplot(data = titanic,
       aes(x = pclass)) +
  geom_bar() 

Create a ggplot using the titanic data frame.

Map the pclass to the x-axis.

Add a layer of a bar plot.

ggplot(data = titanic,
       aes(x = pclass)) +
  geom_bar()

Visualizing a Single Numeric Variable

Create a ggplot using the titanic data frame.

Map the fare to the x-axis.

Add a layer of a histogram.

ggplot(data = titanic,
       aes(x = fare)) +
  geom_histogram() 

Create a ggplot using the titanic data frame.

Map the fare to the x-axis.

Add a layer of a density.

ggplot(data = titanic,
       aes(x = fare)) +
  geom_density() 

Etymology

histo comes from the Greek word histos that literally means “anything set up right”.

gram: comes from the Greek word gramma which means “that which is drawn”.

[Online Etymology Dictionary]

Histogram bins

ggplot(data = titanic,
       aes(x = fare)) +
  geom_histogram(binwidth = 15) 

:::

There is no “best” number of bins

🌈

Pick your favorite color(s) from the list at:

bit.ly/colors-r

or you can use hex codes.

ggplot(data = titanic,
       aes(x = fare)) +
  geom_histogram(binwidth = 15, 
                 color = "white") 
ggplot(data = titanic,
       aes(x = fare)) +
  geom_histogram(binwidth = 15, 
                 fill = "darkred") 
ggplot(data = titanic,
       aes(x = fare)) +
  geom_histogram(binwidth = 15, 
                 color = "white", 
                 fill = "darkred") 

Visualizing Two Categorical Variables

Stacked Bar-Plot

ggplot(data = titanic,
       aes(x = pclass, 
           fill = survived)) 
  geom_bar() 

Standardized Bar Plot

ggplot(data = titanic,
       aes(x = pclass, 
           fill = survived)) + 
  geom_bar(position = "fill") 

Note that y-axis is no longer count but we will learn how to change that later.

New Data

Artwork by @allison_horst

New Data

glimpse(penguins)
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Artwork by @allison_horst

Visualizing a single numerical and single categorical variable

ggplot(penguins,
       aes(x = species,
           y = bill_length_mm))  +
  geom_violin()

Visualizing a single numerical and single categorical variable

ggplot(penguins,
       aes(x = species,
           y = bill_length_mm))  +
  geom_boxplot()

Visualizing Two Numerical Variables

ggplot(penguins,
       aes(x = bill_depth_mm,
           y = bill_length_mm))  +
  geom_point()

Considering More Than Two Variables

ggplot(penguins,
       aes(x = bill_depth_mm,
           y = bill_length_mm,
           color = species)) +
  geom_point()

ggplot(penguins,
       aes(x = bill_depth_mm,
           y = bill_length_mm,
           shape = species)) +
  geom_point()

ggplot(penguins,
       aes(x = bill_depth_mm,
           y = bill_length_mm,
           shape = species)) +
  geom_point()

ggplot(penguins,
       aes(x = bill_depth_mm,
           y = bill_length_mm,
           shape = species,
           color = species)) +
  geom_point()

ggplot(penguins,
       aes(x = bill_depth_mm,
           y = bill_length_mm,
           shape = species,
           color = species,
           size = body_mass_g)) +
  geom_point()

tidyverse style guide

+ should always have a space before it and be followed by a new line.

Further Resources

Practice

Using either the babies, titanic or penguins data frame ask a question that you are interested in answering. Visualize data to get a visual answer to the question. What is the visual telling you? Note all of this down in your lecture notes.