Advanced Data Visualization

ISI-BUDS 2025

ggplot resources

We have barely touched the surface of ggplot2.

Some examples

an illustration done an old paper with cicrles representing possibly stars with some Chinese characters attached to them.

North Circumpolar Region from the Dunhuang Star Chart circa 649-684 CE.

Recommended reading

Funkhouser, H. G. (1937). Historical Development of the Graphical Representation of Statistical Data. Osiris, 3, 269–404. Chapter 2 is on The Origin of the Graphic Method.

The title of the plot reads change in kindergarten measles vaccination rates. On the x axis the values range from 80% to 100%. Each state has two values reporesented. For instance for Idaho prepandemic vaccination rate is around 89% but vaccination rate in 2023-2024 is about 80%. There is a line labeled as Idaho with an arrow showing the direction from 89% to 80%. Other statements some in the direction of increases as well as US average is visible.
01:00

There are spirals at the edge of a circle going inwards. Outside ring is labeled as 1875 and $21.86. Going from outer rings to inner rings labels are as follows 1880 $488.532, 1885 $738.170, 1890 $1173.684, 1885 $1322.894, 1888 $1434.975. The length of the rings increase as the dollar amounts increase.

Assessed value of household and kitchen furniture owned by Black people in Georgia.

A non-square grid of wooden sticks with some material stuck to intersections.

20th century navigational chart from Kwajalein Attoll, Marshall Islands, Micronesia on display at Bower Museum in Santa Ana. Photo by Mine Dogucu.

nprscience · Supernova Sonification (Two)

Wanda Díaz-Merced is a Puerto Rican astronomer known for using sonification while studying stars. She is the director of the Arecibo Observatory.

The figure shows a layered wedding cake. In the caption it says "according to data from the city of Buenos Aires since 2010 the number of LGBTQ+ marriages per year has almost tripled" At the top of the cake there is a gay couple. The following years and numbers are associated with each layer of the cake going from the top layer to the bottom layer 2010 and 786, 2014 and 870, 2018 and 1038, 2022 and 1720.

Same-sex marriages in Buenos Aires City by Macarena Zappe

This is a table with a title "Excess mortality since region/country's first 50 covid deaths" Columns of the table include region/country, time period, covid-19 deaths (which shows number of deaths in a bar corresponding to length),total excess death (which also shows number of deaths in a different colored bar corresponding to length), and covid-19 as % of total. The bar lengths go from longest to shortest from top to bottom of the table.

COVID related deaths table by the Economist

Mapping

Image source

Image source

Some important principles for data visualization

Avoid deception

Truncated Axis

The title of the plot reads "if Bush tax cuts expire" It is a bar plot the first bar is labeled as now at 35% the next bar is labeled as Jan 1, 2013 labeled at 39.6% The y axis starts at 34%.

Same as the previous graph accept that the y axis goes from 0 to 40% and thus the bars don't seem that different from one another.

“The principle of proportional ink: The sizes of shaded areas in a visualization need to be proportional to the data values they represent.” (Bergstrom and West, 2016)

Aspect ratio

three plots with same data. The x-axis always has year and the y-axis always has life expectancy. The plots are labeled as aspect ration 1:2, aspect ration 1:1, and aspect ration 2:1. In the first plot the x axis is double the y axis. In the third plot the y axis is double that of x axis. Thus in the first plot the trend can be perceived to have a low positive slope where as in the third plot the trend seems like a steeper positive change.

Choose colors with a purpose

Color for grouping

The title reads "Estimated share of children with blood levels at or above 5 micrograms per deciliter. Each country is shown as a circle on the plot scattered around with y axis labeled as going from higher rates of elevated lead levels around 100% to lower rates of elevated lead levels going all the way down to zero percent. Each circle has a different color which represents the region such as Africa, Asia, Europe, Middle East, North America, Ocenia, and South America."

Color for representing numeric values

The title of the plot reads "Local news is now an endangered species in much of the United States. The plot shows county level US map, each county is colored with the legent ranging from none (shown in red) to 10+ news outlets (shown in green)"

Color for emphasis

The title of the plot states warmth in the Gulf of Mexico. On the x axis we see months, on the Y axis we see values ranging from 0 to 80 kJ/cm^2. We are also provided a text "This chart shows a measure of ocean heat content expressed by kilojoules per square centimeter". There are many gray curves each representing an individual year, and a dotted curve for showing average 2012-2023. These curves seem to pick between Aug-Oct. There is one curve that is red and has a specific point labeled as Oct 7, 2024. This curve seems above the dotted curve.

Color Theory

The Hue bar (top) shows the full range of color hues mapped to degrees from 0° to 360°, wrapping around the color wheel—starting at red, through yellow, green, blue, magenta, and back to red. The Saturation bar (middle) shows how "intense" or "pure" the color is, going from 0% (completely desaturated, i.e., grayscale) to 100% (fully saturated, pure color). The Lightness/Brightness bar (bottom) shows how light or dark the color is, from 0% (black) to 100% (white), with the pure color appearing in the middle when lightness is 50%.

How to Pick a Color Palette

Okabe-Ito Color Palette

In 2008, Masataka Okabe and Kei Ito proposed a color palette that is accessible to people with various color deficiencies. We use their last names referring to the color palette.

palette.colors(palette = "Okabe-Ito")
[1] "#000000" "#E69F00" "#56B4E9" "#009E73" "#F0E442" "#0072B2" "#D55E00"
[8] "#CC79A7" "#999999"

Okabe-Ito Color Palette

The codes displayed with a hashtag are called hex color code. You can use hex codes in R (and in HTML) to specify colors.

Color-Blindness Simulation

species_bills <- 
  ggplot(penguins,
         aes(x = bill_depth_mm,
             y = bill_length_mm,
             color = species)) +
  geom_point(size = 4) 

By storing the plot as an object named species_bills, we will be able to use it in other functions.

Color-Blindness Simulation

colorblindr::cvd_grid(species_bills) 

The cvd_grid() function from the colorblindr() package creates a grid of different color-deficiency simulations.

Deuteranomaly is reduced sensitivity to green light Protanomaly, is reduced sensitivity to red light Tritanomaly is reduced sensitivity to blue light Desaturated is no color difference

Color-Blindness Simulation

Okabe-Ito Color Palette

species_bills + 
  scale_color_manual(values = c("Adelie" = "#E69F00", "Chinstrap" = "#56B4E9", "Gentoo" = "#009E73"))

Okabe-Ito Color Palette

species_bills + 
  colorblindr::scale_color_OkabeIto()

Fonts matter

Fonts matter for clarity

a food packaging that reads as key lime tarts but the font used makes the letters t in the words tarts seem like an f instead.

Fonts matter for the message

Two postit note both of which say please be mine. The left note is written with a curvy almost cursive font. The right note is written with a font that looks like blood is dripping.



Comparison of four numeric styles Tabular Lining, Proportional Lining, Tabular Oldstyle, and Proportional Oldstyle. Each style displays the number '1984'. Tabular styles align numbers to equal widths; proportional styles use variable widths. Lining styles have uniform height; oldstyle styles use varying heights with some digits extending above or below the baseline.

Tip

Use lining and tabular fonts for numbers.




sysfonts::font_add_google("Cabin")

ggplot(data = penguins,
       aes(x = bill_length_mm)) +
  geom_histogram() +
  labs(title = "Distribution of Bill Lengths of Palmer Penguins") +
     theme(text = element_text(family = "Cabin"))

Write alternate text

Screen reader example

The video shows use of a screen reader briefly.

Alternate Text

  • “Alt text” describes contents of an image.
  • Screen-readers cannot read images but can read alt text.
  • Alt text has to be provided.

Manual Alternate Text

  • Chart type

  • Type of data

  • Reason for including the chart

  • Link to data or source (not in alt text but in main text)

Cesal, 2020

  • Description conveys meaning in the data

  • Variables included on the axes

  • Scale described within the description

  • Type of plot is described

Canelón & Hare, 2021

Alt Text in Quarto

```{r}
#| fig-align: center
#| fig-cap: Relationship between bill depth (mm) and length (mm) for different species of penguins
#| fig-alt: The scatterplot shows bill depth in mm on the x-axis and bill length in mm on the y-axis with points differently colored for different species as Adelie, Chinstrap, and Gentoo. The x axis ranges from about 12.5 mm to 22.5 mm. The y-axis ranges from about 30 to 60 mm. For all species the relationship seems moderately positive. When comparing the three species, Adelie penguins seem to have longer bill depth but shorter bill length. Chinstraps have longer bill depth and longer bill length. Gentoo penguins have shorter bill depth and longer bill length.  

ggplot(penguins, aes(x = bill_depth_mm,
                     y = bill_length_mm,
                     color = species)) +
  geom_point(size = 4) 
```
The scatterplot shows bill depth in mm on the x-axis and bill length in mm on the y-axis with points differently colored for different species as Adelie, Chinstrap, and Gentoo. The x axis ranges from about 12.5 mm to 22.5 mm. The y-axis ranges from about 30 to 60 mm. For all species the relationship seems moderately positive. When comparing the three species, Adelie penguins seem to have longer bill depth but shorter bill length. Chinstraps have longer bill depth and longer bill length. Gentoo penguins have shorter bill depth and longer bill length.

Relationship between bill depth (mm) and length (mm) for different species of penguins

Caption vs. Alt Text

Figure captions (fig-cap) appear on the front-end of a document and is accessible to all whether they are reading it directly or via screen readers.

Figure alternate text (fig-alt) only appears on the back-end of a document and is accessible to screen readers and those who know how to investigate the source code of a (HTML) document.

Even though, we are using captions and alternate text in Quarto, these are available features in many other software (e.g., Google doc, PowerPoint etc.)

Label axes and titles meaningfully

Don’t make the viewer squint

Don’t make the viewer do the math

Image Source

An example

Many design decisions go into making a data visualization. The following example is from one of my favorite data visualization experts Cara Thompson shared with CC-BY license.

Data context

Table showing a study on odontoblast length (cells responsible for tooth growth) based on type and dose of supplement. The table compares Ascorbic Acid (Vitamin C) and Orange Juice at three doses 0.5 mg, 1 mg, and 2 mg. Each cell in the table contains rows of guinea pig face icons representing individual subjects in each condition.

This is a bar plot with x axis labeled as 0.5, 1, and 2 for each bar. Within each bar we see two colors red and blue. In the legend the supp variable is defined with red as OJ and blue as VC. The y-axis shows mean-length.

this plot is a dodged barplot where the OJ and VC supp is shown next to each other as separate bars.

that bars get a white outline

the legend text changes to supplement, Orange Juice, and Vitamin C

the gray background is replace with a white one

the x axis is now labeled as categorical_dose and there is no value of 1.5 which was initially a gap between dose 1 and dose 2 bars.

the orange juice and vitamin c is separated into two facets with orange juice on top as a separate bar plot.

There is a title that reads "In smaller doses, Orange Juice was associated with greater mean tooth growth, compared to equivalent doses of Vitamin C" and a subtitle that reads "With the highest dose, the mean recorded length was almost identical."

Vitamin C bars are now shown with reddish orange color and orange juice is shown with a yellowish orange color.

dose is introduced to legend with lower-to high dose ranging in light to dark. This change is reflected in the colors of the bars too.

legend is removed

x and y axis are flipped

title is bolded, fonts have changed.

white space is added between subtitle and the plots

the y-axis label is removed.

reduced the line spacing of two lines of the title

the words orange juice and vitamin c in the title match the corresponding colors of the bars.

The toothgrowth figure with the initial software defaults

The toothgrowth figure after all the desired changes

Tip

Do not rely on software defaults for font size, font type, colors, labels, text alignment, legend, etc. without intention.