Basics of Data Visualization

ISI-BUDS 2025

Different Representations of Data

We can represent data using some of the following formats

  • visual
  • text
  • sound
  • tactile

Today we will cover data represented in visuals but throughout the week we will cover different data representations.

Accessibility

Data visualization is perhaps the most commonly used format for representing data.

Data visualization can convey a lot about data, however visualizations are not accessible to everyone. For instance, they are not accessible to those who are blind and visually impaired.

Different modes (e.g., sound) of representation are especially important for making the data representation accessible to all.

Data Visualization

Examples

How Common Is Your Birthday?

One Dataset Visualized 25 Ways

Mandatory Paid Vacation

Why are K-pop groups so big?

We will only touch the surface of data visualization in this class. It is a rich field and some of you may possibly consider a career in data visualization.

Data Visualizations

  • are graphical representations of data

  • use different colors, shapes, and the coordinate system to summarize data

  • can tell a story or can be useful for exploring data

Data

library(openintro)
glimpse(babies)
Rows: 1,236
Columns: 8
$ case      <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
$ bwt       <int> 120, 113, 128, 123, 108, 136, 138, 132, 120, 143, 140, 144, …
$ gestation <int> 284, 282, 279, NA, 282, 286, 244, 245, 289, 299, 351, 282, 2…
$ parity    <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ age       <int> 27, 33, 28, 36, 23, 25, 33, 23, 25, 30, 27, 32, 23, 36, 30, …
$ height    <int> 62, 64, 64, 69, 67, 62, 62, 65, 62, 66, 68, 64, 63, 61, 63, …
$ weight    <int> 100, 135, 115, 190, 125, 93, 178, 140, 125, 136, 120, 124, 1…
$ smoke     <lgl> FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE,…

?babies

case id number

bwt birthweight, in ounces

gestation length of gestation, in days

parity binary indicator for a first pregnancy (0 = first pregnancy)

age mother’s age in years

height mother’s height in inches

weight mother’s weight in pounds

smoke binary indicator for whether the mother smokes

Bar plot

  • When can we use a bar plot?
  • What does this bar plot convey?

Bar plot

ggplot(babies)

Bar plot

ggplot(babies, aes(x = smoke)) 

Bar plot

ggplot(babies, aes(x = smoke)) +
  geom_bar()

Histogram

  • When can we use an histogram?
  • What does this histogram convey?

Histogram

ggplot(babies)

Histogram

ggplot(babies, aes(x = bwt))

Histogram

ggplot(babies, aes(x = bwt)) +
  geom_histogram()

Binwidth

ggplot(babies, aes(x = bwt)) +
  geom_histogram(binwidth = 3)

Histogram

Consider the height distribution in our class.

  • How would the distribution change if Michael Jordan (198.1 cm, 6’ 6’’) were to join our class?

  • How would the distribution change if Tyrion Lannister (Peter Dinklage) (135 cm, 4’ 5’’) were to join our class?

Think 💭 - Pair 👫🏽 - Share 💬

  • In right-skewed distributions mean > median, true or false?

  • In left-skewed distributions mean > median, true or false?

When data display a skewed distribution we rely on median rather than the mean to understand the center of the distribution.

More on Histograms

There is no “best” number of bins

Exploring Histograms Visually

Take a look at these for fun.

Looking at Relationships

So far we seen barplots and histograms both of which are useful for visualizing categorical and numerical variables respectively.

We are often interested in looking at relationships between two variables. We have statistical tests to examine such relationships. However, visualizations can often help us explore if such relationships are worth looking into.

Standardized Bar Plots

ggplot(data = babies,
       aes(x = smoke, 
           fill = parity)) + 
  geom_bar(position = "fill")

Note that the y axis still shows as a count. We will learn how to change the axis labels in the next lecture.

Dodged Bar Plot

ggplot(data = babies,
       aes(x = smoke, 
           fill = parity)) + 
  geom_bar(position = "dodge")

Side-by-Side Boxplots

ggplot(babies,
       aes(x = smoke,
           y = bwt))  +
  geom_boxplot() 

Scatter plots

ggplot(babies,
       aes(x = gestation,
           y = bwt))  +
  geom_point()

Length of gestation can possibly eXplain a baby’s birth weight. Gestation is the eXplanatory variable and is shown on the x-axis. Birth weight is the response variable and is shown on the y-axis.

Linear Relationship

Later on we will start statistical modeling during which we will numerically define the relationship between gestation and birth weight. For now we can say that this relationship looks positive and moderate.

Meet Palmer Penguins1

Data

library(palmerpenguins)
glimpse(penguins)
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Visualizing Three Variables

ggplot(penguins, 
       aes(x = body_mass_g, 
           y = bill_length_mm,
           color = species)) +
  geom_point()

code style

The tidyverse style guide has the following convention for writing ggplot2 code.

The plus sign for adding layers + always has a space before it and is followed by a new line.

The new line is indented by two spaces. RStudio does this automatically for you.

Labeling Axes

ggplot(penguins,
       aes(x = bill_depth_mm,
           y = bill_length_mm,
           color = species)) +
  geom_point() +
  labs(x = "Bill Depth (mm)", 
       y = "Bill Length (mm)",
       color = "Species",
       title = "Palmer Penguins") 

We can change axes and plot labels using the labs() function.

Themes

ggplot(penguins,
       aes(x = bill_depth_mm,
           y = bill_length_mm,
           color = species)) +
  geom_point() +
  labs(x = "Bill Depth (mm)", 
       y = "Bill Length (mm)", 
       title = "Palmer Penguins") +
  theme_gray()

Theme gray is the default theme in ggplot.

ggplot(penguins,
       aes(x = bill_depth_mm,
           y = bill_length_mm,
           color = species)) +
  geom_point() +
  labs(x = "Bill Depth (mm)", 
       y = "Bill Length (mm)", 
       title = "Palmer Penguins") +
  theme_bw()

ggplot(penguins,
       aes(x = bill_depth_mm,
           y = bill_length_mm,
           color = species)) +
  geom_point() +
  labs(x = "Bill Depth (mm)", 
       y = "Bill Length (mm)", 
       title = "Palmer Penguins") +
  theme_dark()

ggplot(penguins,
       aes(x = bill_depth_mm,
           y = bill_length_mm,
           color = species)) +
  geom_point() +
  labs(x = "Bill Depth (mm)", 
       y = "Bill Length (mm)", 
       title = "Palmer Penguins") +
  theme_classic()

ggplot(penguins,
       aes(x = bill_depth_mm,
           y = bill_length_mm,
           color = species)) +
  geom_point() +
  labs(x = "Bill Depth (mm)", 
       y = "Bill Length (mm)", 
       title = "Palmer Penguins") +
  theme_minimal()

Font Size

ggplot(penguins,
       aes(x = bill_depth_mm,
           y = bill_length_mm,
           color = species)) +
  geom_point() +
  labs(x = "Bill Depth (mm)", 
       y = "Bill Length (mm)", 
       title = "Palmer Penguins") +
  theme(text = element_text(size=20))

The theme() function allows for many components of a theme. By typing ?theme in the Console, you can read the documentation of the function to see what components can be modified.

Font Size

One can also set the default font size of theme. For instance, if you utilize the following code at the first chunk of a Quarto document, all plots will be in gray theme and will have a font size of 22.

theme_set(theme_gray(base_size = 22))

Using Shapes in Addition to Colors

ggplot(penguins,
       aes(x = bill_depth_mm,
           y = bill_length_mm,
           shape = species,
           color = species)) +
  geom_point(size = 4) 

Previously species were only distinguishable to someone who could distinguish these colors. By using shapes, color-blind viewers can also distinguish the species.

Practice

Using the penguins data frame ask a question that you are interested in answering. Visualize data to get a visual answer to the question. What is the visual telling you? Note all of this down in your lecture notes.