R visualization - ggplot2

« ChIP-seq analysis Course Data munging / cleaning »

Adapted from the NCEAS dataviz module and R for Data Science

library(ggplot2)
library(dplyr)

Overview

ggplot2 is a popular package for visualizing data in R.

From the home page:

ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

It’s been around for years and has pretty good documentation and tons of example code around the web (like on StackOverflow). This lesson will introduce you to the basic components of working with ggplot2.

Lesson

ggplot vs base vs lattice vs XYZ…

R provides many ways to get your data into a plot. Three common ones are,

“base graphics” (plot, hist, etc`)
lattice
ggplot2

All of them work! I use base graphics for simple, quick and dirty plots. I use ggplot2 for most everything else.

ggplot2 excels at making complicated plots easy and easy plots simple enough.

Geoms / Aesthetics

First steps

Let’s use our first graph to answer a question: Do cars with big engines use more fuel than cars with small engines? You probably already have an answer, but try to make your answer precise. What does the relationship between engine size and fuel efficiency look like? Is it positive? Negative? Linear? Nonlinear?

The `mpg` data frame

You can test your answer with the mpg data frame found in ggplot2 (a.k.a. ggplot2::mpg). A data frame is a rectangular collection of variables (in the columns) and observations (in the rows). mpg contains observations collected by the US Environmental Protection Agency on 38 car models.

mpg

## # A tibble: 234 × 11
##    manufacturer model      displ  year   cyl trans    drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr>    <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto(l5) f        18    29 p     comp…
##  2 audi         a4           1.8  1999     4 manual(… f        21    29 p     comp…
##  3 audi         a4           2    2008     4 manual(… f        20    31 p     comp…
##  4 audi         a4           2    2008     4 auto(av) f        21    30 p     comp…
##  5 audi         a4           2.8  1999     6 auto(l5) f        16    26 p     comp…
##  6 audi         a4           2.8  1999     6 manual(… f        18    26 p     comp…
##  7 audi         a4           3.1  2008     6 auto(av) f        18    27 p     comp…
##  8 audi         a4 quattro   1.8  1999     4 manual(… 4        18    26 p     comp…
##  9 audi         a4 quattro   1.8  1999     4 auto(l5) 4        16    25 p     comp…
## 10 audi         a4 quattro   2    2008     4 manual(… 4        20    28 p     comp…
## # … with 224 more rows
## # ℹ Use `print(n = ...)` to see more rows

Among the variables in mpg are:

displ, a car’s engine size, in liters.
hwy, a car’s fuel efficiency on the highway, in miles per gallon (mpg). A car with a low fuel efficiency consumes more fuel than a car with a high fuel efficiency when they travel the same distance.

To learn more about mpg, open its help page by running ?mpg.

If we wanted it to show up in our “Environment” pane of RStudio, we could run data(mpg). At that point, you can click on it, then scroll around to explore more of the table.

Creating a ggplot

To plot mpg, run this code to put displ on the x-axis and hwy on the y-axis:

Every graphic you make in ggplot2 will have at least one aesthetic and at least one geom (layer). The aesthetic maps your data to your geometry (layer). Your geometry specifies the type of plot we’re making (point, bar, etc.).

ggplot(data = mpg, aes(x = displ, y = hwy)) + 
  geom_point()

You’ll frequently see people use positional arguments and just write this as a shortcut.

ggplot(mpg, aes(displ,hwy)) + 
  geom_point()

That’s okay to do and produces the same plot, but if in doubt, explicitly setting arguments can help avoid confusion.

What makes ggplot really powerful is how quickly we can make this plot visualize more aspects of our data. Coloring each point by class (compact, van, pickup, etc.) is just a quick extra bit of code:

ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point()

Aside: How did I know to write color = class? aes will pass its arguments on to any geoms you use and we can find out what aesthetic mappings geom_point takes with ?geom_point (see section “Aesthetics”)

Challenge: Find another aesthetic mapping geom_point can take and use it to visualize the data in the drv column.

What if we just wanted the color of the points to be blue? Maybe we’d try this:

ggplot(mpg, aes(x = displ, y = hwy, color = "blue")) +
  geom_point()

Well that’s weird – why are the points red?

What happened here? This is the difference between setting and mapping in ggplot. The aes function only takes mappings from our data onto our geom. If we want to make all the points blue, we need to set it inside the geom:

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(color = "blue")

Challenge: Using the aesthetic you discovered and tried above, set another aesthetic onto our points.

Sizing each point by the number of cylinders is easy:

ggplot(mpg, aes(x = displ, y = hwy, color = class, size = cyl)) +
  geom_point()

So it’s clear we can make scatter and bubble plots. What other kinds of plots can we make? (Hint: Tons)

Let’s make a histogram:

ggplot(mpg, aes(x = hwy)) + 
  geom_histogram()

You’ll see with a warning (red text):

stat_bin() using bins = 30. Pick better value with binwidth.

ggplot2 can calculate statistics on our data such as frequencies and, in this case, it’s doing that on our hwy column with the stat_bin function. Binning data requires choosing a bin size and the choice of bin size can completely change our histogram (often resulting in misleading conclusions). We should change the bins argument in this case to 1 because we don’t want to hide any of our frequencies:

ggplot(mpg, aes(x = hwy)) + 
  geom_histogram(binwidth = 1)

Challenge: Find an aesthetic geom_histogram supports and try it out.

I’m a big fan of box plots and ggplot2 can plot these too:

ggplot(mpg, aes(x = cyl, y = hwy)) + 
  geom_boxplot()

Oops, we got an error:

Warning message: Continuous x aesthetic – did you forget aes(group=…)?

That’s because we need to convert cyl to a factor:

ggplot(mpg, aes(x = factor(cyl), y = hwy)) + 
  geom_boxplot()

Another type of visualization I use a lot for seeing my distributions is the violin plot:

ggplot(mpg, aes(x = factor(class), y = hwy)) + geom_violin()

How could we make this more visually appealing? How could we color the violins? One option would be to set an aesthetic on the violins themselves:

ggplot(mpg, aes(x = factor(class), y = hwy)) + geom_violin(fill="cornflowerblue")

We could also give each class it’s own color by mapping to the color

ggplot(mpg, aes(x = factor(class), y = hwy, fill = factor(class))) + geom_violin()

So far we’ve made really simple plots: One geometry per plot. Let’s layer multiple geometries on top of one another to show the raw points on top of the violins:

ggplot(mpg, aes(x = factor(class), y = hwy)) + 
  geom_violin() +
  geom_point(shape = 1)

Our points got added, but it’s impossible to see the “pileup” of points because they’re all plotted on top of each other. Adding some “jitter” can help with that.

ggplot(mpg, aes(x = factor(class), y = hwy)) + 
  geom_violin() +
  geom_point(shape = 1, position = "jitter")

Some geoms can do even more than just show us our data. ggplot2 also helps us do some quick-and-dirty modeling:

ggplot(mpg, aes(x = cty, y = hwy)) + 
  geom_point() +
  geom_smooth()

Notice the mesage in red text

geom_smooth() using method = ‘loess’

geom_smooth defaulted here to using a LOESS smoother. But geom_smooth() is pretty configurable. Here we set the method to lm instead of the default loess:

ggplot(mpg, aes(cty, hwy)) + 
  geom_point() +
  geom_smooth(method = "lm")

More on geoms built in to ggplot here: http://ggplot2.tidyverse.org/reference/index.html#section-layer-geoms and more that you can add by installing packages here: https://exts.ggplot2.tidyverse.org/gallery/

Setting plot limits

Plot limits can be controlled one of three ways:

Filter the data (because limits are auto-calculated from the data ranges)
Set the limits argument on one or both scales
Set the xlim and ylim arguments in coord_cartesian()

Let’s show this with an example plot:

ggplot(economics, aes(date, unemploy)) + 
  geom_line()

Since we’re plotting data where the zero point on the vertical axis means something, maybe we want to start the vertical axis at 0:

ggplot(economics, aes(date, unemploy)) + 
  geom_line() +
  scale_y_continuous(limits = c(0, max(economics$unemploy)))

Or, for a an easier to remember argument, use ylim:

ggplot(economics, aes(x = date, y = unemploy)) + 
  geom_line() +
  ylim(0,20000)

Or maybe we want to zoom in on just the 2000’s and beyond:

ggplot(economics, aes(date, unemploy)) + 
  geom_line() +
  scale_y_continuous(limits = c(0, max(economics$unemploy))) +
  scale_x_date(limits = c(as.Date("2000-01-01"), as.Date(Sys.time())))

Note the warning message we received:

Warning message: Removed 390 rows containing missing values (geom_path).

That’s normal when data in your input data.frame are outside the range we’re plotting.

Let’s use coord_cartesian instead to change the x and y limits:

ggplot(economics, aes(date, unemploy)) + 
  geom_line() +
  coord_cartesian(xlim = c(as.Date("2000-01-01"), as.Date(Sys.time())),
                  ylim = c(0, max(economics$unemploy)))

Note the **slight* difference when using coord_cartesian: ggplot didn’t put a buffer around our values. Sometimes we want this and sometimes we don’t and it’s good to know this difference.

Scales

The usual use case is to do things like changing scale limits or change the way our data are mapped onto our geom. We’ll use scales in ggplot2 very often!

For example, how do we override the default colors ggplot2 uses here?

ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point()

Tip: Most scales follow the format `scale_{aesthetic}_{method} where aesthetic are our aesthetic mappings such as color, fill, shape and method is how the colors, fill colors, and shapes are chosen.

ggplot(mpg, aes(displ, hwy, color = class)) +
  geom_point() + 
  scale_color_manual(values = c("red", "orange", "yellow", "green", "blue", "purple", "violet")) # ROYGBIV

I’m sure that was a ton of fun to type out but we can make things easier on ourselves:

ggplot(mpg, aes(displ, hwy, color = class)) +
  geom_point() + 
  scale_color_hue(h = c(270, 360)) # blue to red

Above we were using scales to scale the color aesthetic. We can also use scales to rescale our data. Here’s some census data, unscaled:

ggplot(midwest, aes(x = area, y = poptotal)) + 
  geom_point()

And scaled (log10):

ggplot(midwest, aes(area, poptotal)) + 
  geom_point() + 
  scale_y_log10()

We could have also log-scaled the y-axis like this:

ggplot(midwest, aes(x = area, y = log10(poptotal))) + 
    geom_point()

What’s the difference? (look closely!)

Scales can also be used to change our axes. For example, we can override the labels:

hwybyclass = summarize(group_by(mpg,class), maxhwy = max(hwy))
ggplot(hwybyclass, aes(x = class, y = maxhwy)) +
    geom_col() +
    scale_x_discrete(labels = paste("group",1:7))

Or change the breaks:

ggplot(mpg, aes(displ, hwy)) + 
  geom_point() +
  scale_y_continuous(breaks = seq(min(mpg$hwy), max(mpg$hwy)))

Facets allow us to create a powerful visualization called a small multiple:

http://www.latimes.com/local/lanow/la-me-g-california-drought-map-htmlstory.html

I use small multiples all the time when I have a variable like a site or year and I want to quickly compare across years. Let’s compare highway fuel economy versus engine displacement across our two samples:

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  facet_wrap(~ year)

Or fuel economy versus engine displacement across manufacturer:

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  facet_wrap(~ manufacturer)

Plot customization

ggplot2 offers us a very highly level of customizability in, what I think, is a fairly easy to discover and remember way with the theme function and pre-set themes.

ggplot2 comes with a set of themes which are a quick way to get a different look to your plots. Let’s use another theme than the default:

ggplot(mpg, aes(displ, hwy, color = class)) + 
  geom_point() +
  theme_classic()

Challenge: Find another theme and use it instead

The legend in ggplot2 is a thematic element. Let’s change the way the legend displays:

ggplot(mpg, aes(displ, hwy, color = class)) + 
  geom_point() +
  theme_classic() +
  theme(legend.position = "bottom",
        legend.background = element_rect(fill = "#EEEEEE", color = "black"),
        legend.title = element_blank(),
        axis.title = element_text(size = 16))

Let’s adjust our axis labels and title:

ggplot(mpg, aes(displ, hwy, color = class)) + 
  geom_point() +
  theme_classic() +
  theme(legend.position = c(1, 1),
        legend.justification = c(1,1),
        legend.direction = "horizontal",
        legend.title = element_blank()) +
  xlab("Engine Displacement") +
  ylab("Highway Fuel Economy (miles / gallon") +
  ggtitle("Highway fuel economy versus engine displacement",
          "or why do you need that big truck again? ")

Challenge: Look at the help for ?theme and try changing something else about the above plot.

More themes are available in a user-contributed package called ggthemes.

Saving plots

Let’s save that great plot we just made. Saving plots in ggplot is done with the ggsave() function:

ggsave("hwy_vs_displ.png")

ggsave automatically chooses the format based on your file extension and guesses a default image size. We can customize the size with the width and height arguments:

ggsave("hwy_vs_displ.png", width = 6, height = 6)

Resources

Multiple graphs in a single graph:
Book on ggplot: https://www.amazon.com/dp/331924275X/ref=cm_sw_su_dp
- Source code: https://github.com/hadley/ggplot2-book