Author: Software Carpentry, with edits by Jessica Cooperstone. Date: 2024-08-19
In this episode, we will learn about how to create publication-quality
graphics in R using the tidyverse package
ggplot2
. Plotting data is a very
quick and easy way to understand the relationship between your
variables.
The package ggplot2
applies a
framework for plotting such that any plot can be built from the same
basic building blocks. This very popular package is based on a system
called the Grammar of Graphics by Leland Wilkinson which aims to create
a grammatical rules for the development of graphics. It is part of a
larger group of packages called “the tidyverse.”
The “gg” in ggplot stands for “grammar of graphics” and all plots share a common template. This is fundamentally different than plotting using a program like Excel, where you first pick your plot type, and then you add your data. With ggplot, you start with data, add a coordinate system, and then add “geoms,” which indicate what type of plot you want. A cool thing about ggplot is that you can add and layer different geoms together, to create a fully customized plot that is exactly what you want. If this sounds nebulous right now, that’s okay, we are going to talk more about this.
Simplified, we will provide to R:
- our data
- mapping aesthetics - what connects the data to the graphics
- layers - determine which type of plot we are going to make, what cordinate system we will use, what scales we want, and other important aspects of our plot
You can think about a ggplot as being composed of layers. You start with your data, and continue to add layers until you get the plot that you want. This might sound a bit abstract so I am going to talk through this with an example.
Before we plot, let’s look at the data we will use for this session. We
are going to use the same gapminder
data from one hour ago.
In case you don’t have it up from before, let’s load both the
tidyverse
and gapminder
using the function library()
so they are
active.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.3 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(gapminder)
We can use the function View()
to look at our data. This opens our
data sort of like how we might view it in Excel.
View(gapminder)
We can also use the function str()
or glimpse()
to look at the
structure of our data.
str(gapminder)
tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
$ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
$ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
$ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
$ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...
$ pop : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
$ gdpPercap: num [1:1704] 779 821 853 836 740 ...
gapminder
is a tibble (very similar to a data frame), comprised of
1,704 rows and 6 columns. The columns are:
- country (a factor)
- continent (a factor)
- year (an integer)
- lifeExp (a number)
- pop (an integer)
We are going to use the function ggplot()
when we want to make a
ggplot.
Note, the function is called ggplot and the package is called ggplot2
If we want to learn how the function works, we can use the ?
to find
out more.
?ggplot()
We can start by providing the data as the first argument to ggplot()
.
ggplot(data = gapminder)
We have a blank plot! We haven’t given enough information for R to know what we want to plot - we have simply told R what dataframe we will be using. We are getting the first “base” layer of the plot.
Instead of providing the dataframe as the first argument to ggplot()
,
we can use the pipe %>%
(or the newer pipe |>
) that Jelmer taught us
about. This “sends” the data into the next function. I will use this
syntax for the rest of the workshop.
gapminder %>%
ggplot()
Note that both give you the same output.
If we want to make a scatterplot to understand the relationship between
GDP per capita and life expectancy, we can do so by setting x
and y
respectively within aes()
, or our aesthetic mappings.
gapminder %>%
ggplot(mapping = aes(x = gdpPercap, y = lifeExp))
Ok! We don’t have a plot, per se, but we have more than we had before.
We can now see that gdpPercap
is on the x-axis (along with some
numbers reflecting the range of our data), and lifeExp
is on the
y-axis (along with some numbers reflecting the range of our data).
Note that the “mapping” part is actually not necessary.
gapminder %>%
ggplot(aes(x = gdpPercap, y = lifeExp))
Now we need to tell R what geometry, or “geom” want to use. Let’s make a scatterplot here.
gapminder %>%
ggplot(aes(x = gdpPercap, y = lifeExp)) +
geom_point()
All of the “geoms” start with
geom_*()
,
and we can see what they all are by starting to type geom and pressing
tab.
Modify the plot we’ve made so that you can see the relationship between life expectancy and year.
Click for the solution
Map x = year
and y = lifeExp
, and use geom_point()
since we want a
scatterplot.
gapminder %>%
ggplot(aes(x = year, y = lifeExp)) +
geom_point()
Modify to color your points by continent.
Need a hint?
Try using the argument color
within your aesthetic mappings.
Need another hint?
Try setting color = continent
within your aesthetic mappings.
Click for the solution
gapminder %>%
ggplot(aes(x = year, y = lifeExp, color = continent)) +
geom_point()
Instead of making a scatter plot, we might want to make a line plot.
Using ggplot
this is as easy as changing out your geom.
gapminder %>%
ggplot(aes(x = year, y = lifeExp, color = continent)) +
geom_line()
The plot is jumping around a lot since for each year, we have life expectancy data for each country. Our plot here is not a summary of that data, but instead all of that data together, on top of itself.
We might want to have one line for each country, we can do this by
specifying group = country
within our aesthetic mappings.
gapminder %>%
ggplot(aes(x = year, y = lifeExp, color = continent, group = country)) +
geom_line()
A nice thing about ggplot
is that you don’t actually need to decide if
you want to have lines or points, you can have both!
gapminder %>%
ggplot(aes(x = year, y = lifeExp, color = continent, group = country)) +
geom_line() +
geom_point()
Now we see a point for each observation and each point and line is colored based on continent.
You can also set your mappings globally (within ggplot()
) or locally
(within a specific geom). Let’s see what the difference is. Let’s see
what happens when we move color = continent
into
geom_point(aes(color = continent))
.
gapminder %>%
ggplot(aes(x = year, y = lifeExp, group = country)) +
geom_point(aes(color = continent)) +
geom_line()
We can see that now only the points are colored by continent, and the lines are all black (the default color for one group).
Note, the layers are added in the order you indicate, so if you change the order, your plot will change.
Change the order of the point and line layers - what happens?
Click for the solution
Points then lines (points are on the bottom)
gapminder %>%
ggplot(aes(x = year, y = lifeExp, group = country)) +
geom_point() +
geom_line(aes(color = continent))
Lines then points (lines are on the bottom)
gapminder %>%
ggplot(aes(x = year, y = lifeExp, group = country)) +
geom_line(aes(color = continent)) +
geom_point()
Sometimes we might want to apply some kind of transformation to our data while plotting so we can better see relationships between our variables.
Let’s start with a base plot to see the relationship between GDP per
capita (gdpPercap
) and life expectancy (lifeExp
).
gapminder %>%
ggplot(aes(x = gdpPercap, y = lifeExp)) +
geom_point()
The presence of some outliers for gdpPercap
make it hard to see this
relationship. We can try log base 10 transforming the x-axis to see if
this helps using a
scale_*()
function.
gapminder %>%
ggplot(aes(x = gdpPercap, y = lifeExp)) +
geom_point() +
scale_x_log10() # log 10 scales the x axis
I can also make our points a little bit transparent to ease our
overplotting problem (where too many point are on top of each other,
making each point hard to see) by setting alpha =
. Alpha ranges from 0
(totally transparent) to 1 (totally opaque). Note that I did not put
alpha = 0.5
within an aes()
- we are not mapping alpha to some
variable, we are simply setting what alpha should be.
gapminder %>%
ggplot(aes(x = gdpPercap, y = lifeExp)) +
geom_point(alpha = 0.5) + # not inside the aes
scale_x_log10()
Now we can better see the parts of the plot that are very dark are where there are a lot of data points.
We can also add a smoothed line of fit to our data by setting
method = "lm"
within geom_smooth()
to fit a linear model.
gapminder %>%
ggplot(aes(x = gdpPercap, y = lifeExp)) +
geom_point(alpha = 0.5) + # not inside the aes
scale_x_log10() +
geom_smooth(method = "lm") # smooth with a linear model ie "lm"
`geom_smooth()` using formula = 'y ~ x'
We can adjust the thickness of the line by setting linewidth
within
geom_smooth()
.
gapminder %>%
ggplot(aes(x = gdpPercap, y = lifeExp)) +
geom_point(alpha = 0.5) + # not inside the aes
scale_x_log10() +
geom_smooth(method = "lm", linewidth = 3) # smooth with a linear model ie "lm"
`geom_smooth()` using formula = 'y ~ x'
That’s a ridiculously thick line.
Modify the color and the size of the points in the previous example. You’ll also probably want to make the linewidth less ridiculous.
Need a hint?
Don’t put color
and size
inside aes()
.
Want another hint?
The equivalent of linewidth
for points is size
.
Click for the solution
gapminder %>%
ggplot(aes(x = gdpPercap, y = lifeExp)) +
geom_point(alpha = 0.5, color = "purple", size = 0.5) + # not inside the aes
scale_x_log10() +
geom_smooth(method = "lm", linewidth = 1)
`geom_smooth()` using formula = 'y ~ x'
Modify the plot from 4A so points are a different shape and colored by continent with new trendlines
Need a hint?
Don’t put color
and size
inside aes()
.
Want another hint?
The equivalent of linewidth
for points is size
.
Click for the solution
All points are now triangles.
gapminder %>%
ggplot(aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point(shape = 17, alpha = 0.5) +
scale_x_log10() +
geom_smooth(method = "lm", linewidth = 1) # smooth with a linear model ie "lm"
`geom_smooth()` using formula = 'y ~ x'
All points are now open triangles, we are setting continent to fill
and making the outline black.
gapminder %>%
ggplot(aes(x = gdpPercap, y = lifeExp, fill = continent)) +
geom_point(shape = 24, alpha = 0.5, color = "black") +
scale_x_log10() +
geom_smooth(method = "lm", linewidth = 1) # smooth with a linear model ie "lm"
`geom_smooth()` using formula = 'y ~ x'
Mapping shape
to continent
gapminder %>%
ggplot(aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point(aes(shape = continent), alpha = 0.5) +
scale_x_log10() +
geom_smooth(method = "lm", linewidth = 1) # smooth with a linear model ie "lm"
`geom_smooth()` using formula = 'y ~ x'
[Small multiples](https://en.wikipedia.org/wiki/Small_multiple#:~:text=A%20small%20multiple%20(sometimes%20called,was%20popularized%20by%20Edward%20Tufte.) are a useful way to look at data across the same scale to understand patterns.
Let’s say we want to understand how life expectancy changes over time
throughout the Americas? Instead of making an individual plot for each
country, we can use facet_wrap()
to have ggplot
make our plots all
at once.
First let’s use what Jelmer taught
us
to filter
for only the observations from the Americas.
gapminder_americas <- gapminder %>%
filter(continent == "Americas")
Then, this new data frame gapminder_americas
can be the data for our
next plot. Let’s look first without faceting.
gapminder_americas %>%
ggplot(aes(x = year, y = lifeExp, group = country)) +
geom_line()
Faceting allows us to better see each country on its own.
gapminder_americas %>%
ggplot(aes(x = year, y = lifeExp)) +
geom_line() +
facet_wrap(vars(country)) + # make facets by country
theme(axis.text.x = element_text(angle = 45)) # years on the x on a 45 deg angle
The plots we’ve made so far could really benefit from some better
labels. We can set what we want or plot labels to be as arguments in
labs().
gapminder_americas %>%
ggplot(aes(x = year, y = lifeExp)) +
geom_line() +
facet_wrap(vars(country)) + # make facets by country
theme(axis.text.x = element_text(angle = 45)) + # years on the x on a 45 deg angle
labs(x = "Year", # x axis title
y = "Life expectancy", # y axis title
title = "Figure 1. Life expectancy in the Americas from 1952-2007", # main title of figure
)
Often we want to take our plot we have made using R and save it for use someplace else. You can export using the Export button in the Plots pane (bottom right) but you are limited on the parameters for the resulting figure.
We can do this with more control using the function
ggsave()
.
First we will save our plot as an object using the assignment operator
<-
.
life_exp_americas_plot <- gapminder_americas %>%
ggplot(aes(x = year, y = lifeExp)) +
geom_line() +
facet_wrap(vars(country)) + # make facets by country
theme(axis.text.x = element_text(angle = 45)) + # years on the x on a 45 deg angle
labs(x = "Year", # x axis title
y = "Life expectancy", # y axis title
title = "Figure 1. Life expectancy in the Americas from 1952-2007", # main title of figure
)
Then we can save it. I am indicating here to save the plot in a folder
called results
in my working directory, as a file called
lifeExp.png
. If you want your file to go within a folder, you have to
first create that folder.
ggsave(filename = "results/lifeExp.png", # file path and name
plot = life_exp_americas_plot, # what to save
width = 18,
height = 12,
dpi = 300, # dots per inch, ie resolution
units = "cm") # units for width and height
Note to learn more about the arguments in
ggsave()
you can always run?ggsave()
.
Create some box plots that compare life expectancy between the continents over the time period provided. Try and make your plot look nice, add labels and stuff!
Need a hint?
The geom for making a boxplot is geom_boxplot()
.
Want another hint?
Set labels within labs().
Adjust theming with theme()
. Check out the
complete themes.
Click for the solution
gapminder %>%
ggplot(aes(x = continent, y = lifeExp, fill = continent)) +
geom_boxplot() +
facet_wrap(vars(year)) +
theme_classic() + # my favorite complete theme
theme(axis.title.x = element_blank(), # remove x-axis title
axis.text.x = element_blank(), # remove x-axis labels
axis.ticks.x = element_blank()) + # remove x-axis ticks
labs(y = "Life Expectancy (years)",
fill = "Continent") # change the label on top of the legend