RStudio training ggplot2 tutorial


This lesson, “An introduction to ggplot2”, was adapted from The Carpentries licenced under CC BY 4.0.


Objectives:

  • To be able to use ggplot2 to generate publication quality graphics.
  • To understand the basic grammar of graphics, including the aesthetics and geometry layers, transforming scales, and coloring or panelling by groups.

Introduction to ggplot2

Plotting our data is one of the best ways to quickly explore it and the various relationships between variables.

Today we’ll be learning about the ggplot2 package, because it is the most effective for creating publication quality graphics.

ggplot2 is built on the grammar of graphics, the idea that any plot can be expressed from the same set of components: a data set, a coordinate system, and a set of geoms–the visual representation of data points.

The key to understanding ggplot2 is thinking about a figure in layers. This idea may be familiar to you if you have used image editing programs like Photoshop, Illustrator, or Inkscape.

Let’s load ggplot2 and the gapminder dataset and create our first plot:

# First we'll load the gapminder data and the ggplot2 library
library(gapminder)
library(ggplot2)
# library(tidyverse) contains ggplot2 too!
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
  geom_point()

So the first thing we do is call the ggplot function. This function lets R know that we’re creating a new plot, and any of the arguments we give the ggplot function are the global options for the plot: they apply to all layers on the plot.

We’ve passed in two arguments to ggplot. First, we tell ggplot what data we want to show on our figure, in this example the gapminder data we read in earlier. For the second argument we passed in the aes function, which tells ggplot how variables in the data map to aesthetic properties of the figure, in this case the x and y locations. Here we told ggplot we want to plot the “gdpPercap” column of the gapminder data frame on the x-axis, and the “lifeExp” column on the y-axis. Notice that we didn’t need to explicitly pass aes these columns (e.g. x = gapminder[, "gdpPercap"]), this is because ggplot is smart enough to know to look in the data for that column!

By itself, the call to ggplot isn’t enough to draw a figure:

ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp))

We need to tell ggplot how we want to visually represent the data, which we do by adding a new geom layer. In our example, we used geom_point, which tells ggplot we want to visually represent the relationship between x and y as a scatter plot of points:

ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
  geom_point()+
  scale_x_log10()

There are also built-in axis transformation methods, for example:

  • scale_y_sqrt()
  • scale_y_reverse()

Not all transformations are built (eg. log2() transformation), in which case we can also do this:

ggplot(data = gapminder, aes(x = log2(gdpPercap), y = lifeExp)) +
  geom_point()


Check-in 1

Fill in the blank to create a scatter plot showing changes in life expectancy over time.

ggplot(___ , ___( ___, ___)) +
  ___()



Additional Aesthetics

We can also colour points based on different factors in the dataset, such as continent

ggplot(data = gapminder, aes(x = log(gdpPercap), y = lifeExp, colour = continent)) +
  geom_point()

Linear Regression

Let’s add a linear regression line to this plot.

ggplot(data = gapminder, aes(x = log(gdpPercap), y = lifeExp, colour = continent)) +
  geom_point()+
  geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

Here we see a linear regression line for each continent, but what is the over all global trend? Let’s redraw that regression line to represent the global trend.

ggplot(data = gapminder, aes(x = log(gdpPercap), y = lifeExp,colour = continent)) +
  geom_point()+
  geom_smooth(method = "lm", colour="black")
## `geom_smooth()` using formula 'y ~ x'

Check-in 2

Discuss with you neighbour:

  1. Based on your observation in the previous two examples, what do you think is the difference between colour that is defined inside or outside aes()? (2 mins)
  2. What is the difference between defining aes() in ggplot() vs in geom_point() such as the one below? (2 mins)
ggplot(data = gapminder, aes(x = log(gdpPercap), y = lifeExp)) +
  geom_point(aes(colour = continent))+
  geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

ggplot(data = gapminder, aes(x = log(gdpPercap), y = lifeExp,colour = continent)) +
  geom_point()+
  geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'



Geometries

Sometimes, using a scatter plot probably isn’t the best for visualizing change over time, especially when there are multiple countries in a continent.

Instead, let’s tell ggplot to visualize the data as a line plot:

ggplot(data = gapminder, aes(x=year, y=lifeExp, by=country, color=continent)) +
  geom_line()

There are many geom_ options depending on how many variables to plot.

Single variable plots include:

  • geom_histogram()
  • geom_bar()
  • geom_col()
  • geom_density

Two variable plots include:

  • geom_line()
  • geom_point()
  • geom_boxplot()
  • geom_violin()

Here are some additional resources you can refer to for inspiration:

Challenge 1

Without running the following code, which of the following will generate a boxplot showing log2 transformed gdp per capita for each continent? Each continent should be filled in with unique colours.

ggplot(aes(x= continent, y = gdpPercap, fill = continent))+
  geom_boxplot()
ggplot(gapminder, aes(x= continent, y = log2(gdpPercap)), fill = continent)+
  geom_boxplot()
ggplot(gapminder, aes(x= continent, y = log2(gdpPercap), fill = continent))+
  geom_boxplot()
ggplot(gapminder, aes(x= continent, y = gdpPercap))+
  geom_boxplot(fill = continent)+
  scale_x_log2()



Changing and creating labels

Let’s create a violin plot with labels

ggplot(data = gapminder, aes(x = continent, y = lifeExp, fill = continent)) +
  geom_violin()+
  labs(title = "Life expentancy of continents",
      x = "Continents",
      y = "Life expectancy",
      fill = "Continents colours",
      tag = "A")

Let’s use a different background for this plot, there are a few built-in layers we can use:

ggplot(data = gapminder, aes(x = continent, y = lifeExp, fill = continent)) +
  geom_violin()+
  theme_bw() 

We can also use:

  • theme_grey
  • theme_dark
  • theme_light

Challenge 1

Without running the code below, which of the following code will produce violin plot overlayed with a boxplot showing log10-transformed population for each continent? With only the boxplot filled with colour?

ggplot(data = gapminder, aes(x = continent, y = pop, colour = continent)) +
  geom_violin() 
  geom_boxplot(width = 0.3)+
  labs(title = Population by continents,
      x = Continents,
      y = log-transformed life expectancy)
ggplot(data = gapminder, aes(x = continent, y = pop)) +
  geom_violin()+
  geom_boxplot(width = 0.3, aes(fill = continent))+
  scale_y_log10()+
  labs(title = "Population by continents",
      x = "Continents",
      y = "log-transformed life expectancy")
ggplot(data = gapminder, aes(x = continent, y = pop, fill = continent)) +
  geom_violin()+
  geom_boxplot(width = 0.3)+
  scale_y_log10()+
  labs(title = "Population by continents",
      x = "Continents",
      y = "log-transformed life expectancy")
ggplot(data = gapminder, x = continent, y = pop, colour = continent) +
  geom_violin()+
  geom_boxplot(width = 0.3, colour = continent)+
  scale_y_log10()+
  labs(title = "Population by continents",
      x = "Continents",
      y = "log-transformed life expectancy")



Saving plots

Finally, we need to save the plots we created. We can save plots directly from the “Plots” pannel in RStudio by clicking Export. Alternatively, we can encode this into our script as well.

For example, to save a plot as in .pdf we need to open a graphic device to draw our plots into like so:

pdf("My_first_ggplot.pdf", height = 3, width = 5)
ggplot(data = gapminder, aes(x = continent, y = pop)) +
  geom_violin()+
  geom_boxplot(width = 0.3, aes(fill = continent))+
  scale_y_log10()+
  labs(title = "Population by continents",
      x = "Continents",
      y = "log-transformed life expectancy")
dev.off()

It is important here that we remember to close our graphic device when we are finished using dev.off().

There are many options for file formats each with different units for dimensions, which can be found in their help page using ? (eg. ?pdf).

  • bmp()
  • jpeg()
  • png()
  • tiff()



Resources

This is a taste of what you can do with ggplot2. R Studio provides useful cheat sheets of the different layers, and more extensive documentation is available on the ggplot2 website. Finally, I cannot stress enough Googole is your best friend and search will usually send you to a relevant question and answer on Stack Overflow with reusable code to modify!



Related