3_dplyr.qmd

---
title: "Data Frame Manipulation with _dplyr_"
author: "Author: Software Carpentry, with edits by Jelmer Poelstra. Date:"
date: 2024-08-19
format: gfm
execute:
  keep-md: true
editor_options: 
  chunk_output_type: console
---

<br>

## The _dplyr_ package and the tidyverse

The [_dplyr_](https://cran.r-project.org/package=dplyr) package provides a number
of very useful functions for **manipulating data frames**.

Here we're going to cover some of the most commonly used functions,
and will also use pipes (`%>%`) to combine them.

- `select()` to pick columns (variables)
- `filter()` to pick rows (observations)
- `rename()` to change column names
- `arrange()` to change the order of rows (i.e., to sort a data frame)
- `mutate()` to modify values in columns and create new columns
- `summarize()` (with `group_by()`) to compute summaries across rows 

Importantly, all these functions take a data frame as the _input_,
and return a new, modified data frame as the output.

_dplyr_ belongs to a broader family of R packages designed for "dataframe-centric"
data science called the "Tidyverse".
The other tidyverse package that we'll cover in today's workshop is _ggplot2_ for making plots.
You can find more information about the Tidyverse here:
[https://www.tidyverse.org/](https://www.tidyverse.org/).

All of the tidyverse's packages can be loaded at once as follows:

```{r}
library(tidyverse)
```

That will print quote some output which tells you which packages have been loaded
as part of the tidyverse, and which tidyverse functions "conflict with"
(in the sense of having the same name as)
R functions that were already in your environment.

<br>

## The `gapminder` data set

In this section and in the afternoon, we will work with the `gapminder` data set.
This data set is available in a package
(while most packages are built around functions so as to extend R's functionality,
others, like this one, merely contain data sets),
and we can load it as follows:

```{r}
library(gapminder)
```

Let's take a look at the dataset,
which is stored in a data frame also called `gapminder`:

```{r}
gapminder
```

The `gapminder` data frame is a so-called "tibble",
which is the tidyverse version of a data frame.
The main difference is the nicer default printing behavior of tibbles:
e.g. the data types of columns are shown,
and only a limited number of rows will be printed (hence `1,694 more rows`).

As for the dataset itself, note that each row contains some data
for a single country in a specific year
(across 5-year intervals between 1952 and 2007):

- `lifeExp` is the life expectancy in years
- `pop` is the population size
- `gdpPercap` is the per-capita GDP

<br>

## `select()` to pick columns (variables)

To subset a data frame by keeping or removing certain columns,
we can use the `select()` function.

By default, this function will only **keep the columns that you specify**,
which you typically do simply by listing those columns by name:

```{r}
select(.data = gapminder, year, country, gdpPercap)
```

In the command above, the first argument was the data frame,
whereas the other arguments were the (unquoted!) names of columns we wanted to keep.

The order of the columns in the output data frame is exactly as you list them in
`select()`, and doesn't need to be the same as in the input data frame.
In other words, `select()` is also one way to reorder columns.
In the example above, we moved year to come before country, for example.

We can also **specify columns that should be removed**,
by prefacing their name with a `!` (or a `-`):

```{r}
select(.data = gapminder, !continent)
```

There are also ways to select _ranges_ of columns,
and to match columns by their _partial names_,
but that is beyond the scope of this short workshop
(check the `select()` help by typing `?select` to learn more about this).

<br>

## `rename()` to change column names, and the pipe (`%>%`)

Our next _dplyr_ function is one of the simplest:
`rename()` to change column names.

The syntax to specify the new and old name within the function is `new_name = old_name`.
For example, building on the column selection we did above,
we may want to rename the `gdpPercap` column:

```{r}
gapminder_sel <- select(.data = gapminder, year, country, gdpPercap)

rename(.data = gapminder_sel, gdp_per_capita = gdpPercap)
```

It is common to use several (_dplyr_) functions in succession to "wrangle" a dataframe
into a format, and with the data, that we want.
To do so, we could go on like we did above,
successively assigning new data frames and moving on to the next step.

But there is a nicer way of dong this,
using so-called "piping" with a pipe operator:
we will use the `%>%` pipe operator.

Let's start by seeing pipes into action with a reformulation of the code
we used above to first select 3 columns and then rename 1 of them:

```{r}
gapminder %>%
  select(year, country, gdpPercap) %>%
  rename(gdp_per_capita = gdpPercap)
```

What happened here is that we took the `gapminder` data frame,
pushed (or "piped") it into the `select()` function,
whose output was in turn piped into the `rename()` function.

You can think of the pipe as **"then"**:
take `gapminder`, _then_ select, _then_ rename.

When using pipes, piped input replaces our previous way of specifying the input
with the `.data` argument.

Under the hood, when you pipe something into a function,
this will by default be passed to **the first argument of the function**.
_dplyr_ (and other tidyverse) functions are quite consistent
with the first argument always being a data frame,
which makes them "pipe-friendly".

Using pipes is slightly less typing and more readable than successive assignments.
(It is also faster and uses less computer memory.)

<br>

## `filter()` to pick rows (observations)

The `filter()` function can be used to keep only rows that match a condition.
Whereas column-selection often simply happens by column name as we've seen
with `select()`, row-selection tends to be a more intricate and interesting topic.

Let's start with the following example,
where we want to keep observations (remember, these are countries in a given year)
that have a life expectancy (column `lifeExp`) greater than 80 years:

```{r}
gapminder %>%
  filter(lifeExp > 80)
```

So, we specify a _condition_ based on the values in at least one column,
and only the rows that satisfy this condition will be kept.

These conditions don't have to be based on numeric comparisons -- for example:

```{r}
gapminder %>%
  filter(continent == "Europe")
```

(Remember to use _two_ equals signs `==` to test for equality!)

### Filter based on multiple conditions

It's also possible to filter based on multiple conditions:

```{r}
gapminder %>%
  filter(continent == "Asia", year == 2007, lifeExp > 80)
```

By default, multiple conditions are combined in an _AND_ fashion ---
in other words, in a given row, _each_ condition needs to be met for that column
to be kept.

If you want to combine conditions in an _OR_ fashion,
you should use a `|` as follows:

```{r}
gapminder %>%
  filter(lifeExp > 80 | gdpPercap > 10000)
```

Finally, let's practice a bit more with pipelines that use multiple _dplyr_ verbs:

```{r}
gapminder %>%
  filter(continent == "Americas") %>%
  select(year, country, gdpPercap) %>%
  rename(gdp_per_capita = gdpPercap)
```

<br>
<br>

### Challenge 1

Write a single command (which can span multiple lines and include pipes)
that will produce a data frame that has `lifeExp`, `country`, and `year`
for Africa but not for other continents. 
How many rows does your data frame have?

<details><summary>Click for the solution</summary>
```{r}
gapminder %>%
  filter(continent == "Africa") %>%
  select(year, country, lifeExp)
```

It has 624 rows.

</details>

<br>
<br>

## `arrange()` to sort data frames

The `arrange()` function is like the sort function in Excel:
it changes the order of the rows based on the values in one or more columns.
For example, our data set `gapminder` is currently sorted alphabetically by `country`
and then by `year`, but we may instead want to sort observations by population size:

```{r}
gapminder %>%
  arrange(pop)
```

Sorting can be useful to see the observations with the smallest or largest values
for a certain column: above we see that the country and year with the smallest
population size is Sao Tome and Principe in 1952.

Default sorting is from small to large, but of course,
we may also want to sort in the reverse order.
You can do this using the `desc()` (descending, large-to-small) helper function:

```{r}
gapminder %>%
  arrange(desc(pop))
```
 
Finally, it is common to want to sort by multiple columns,
where ties in the first column are broken by a second column (and so on) --
to do so, simply list the columns in the appropriate order:

```{r}
gapminder %>%
  arrange(continent, country)
```

The above example sorts first by continent and _then_ by country.

<br>

## `mutate()` to modify values in columns and create new columns

So far, we've focused on functions that "merely" subset and reorganize data frames.
We've also seen how we can modify column names.
But we haven't seen how we can _change the data_ or _compute derived data_
in data frames.

We can do this with the `mutate()` function.
For example, say that we want to create a new column that has the population size
in millions rather than in individuals:

called `pop_million` that is the result
of dividing the values in the `pop` column by a million.

```{r}
gapminder %>%
  mutate(pop_million = pop / 10^6)
```

So, the code above created a new column called `pop_million` that is the result
of dividing the values in the `pop` column by a million. 

To modify a column rather than adding a new one,
simply assign back to the same name:

```{r}
gapminder %>%
  mutate(pop = pop / 10^6)
```

<br>
<br>

------

### Challenge 2

**A:** Create a new column called `gdp_billion` that has the absolute GDP
       (i.e., not relative to population size) in units of billions.

<details><summary>Click for the solution</summary>

```{r}
gapminder %>%
    mutate(gdp_billion = gdpPercap * pop / 10^9)
```
</details>

**B:** (Bonus) Create a new column `planet` that has the value `earth` in every row.

<details><summary>Click for the solution</summary>

If you simply provide a value, this will be repeated in every row:

```{r}
gapminder %>%
  mutate(planet = "earth")
```

</details>

------

<br>
<br>

## `summarize()` to compute (per-group) summary statistics

In combination with `group_by()`,
the `summarize()` function can compute data summaries across groups of rows of a
data frame.

First, let's see what `summarize()` does when used by itself:

```{r}
gapminder %>%
  summarize(mean_gdpPercap = mean(gdpPercap),
            mean_lifeExp = mean(lifeExp))
```

Above, we computed the mean for two columns, across all rows.
This is already useful,
but in combination with the helper function `group_by()`,
`summarize()` becomes really powerful.

For example, let's compute the mean GDP and mean life expectancy separately
for each continent:

```{r}
gapminder %>%
  group_by(continent) %>%
  summarize(mean_gdpPercap = mean(gdpPercap),
            mean_lifeExp = mean(lifeExp))
```

`group_by()` implicitly splits a data frame into groups of rows:
here, one group for observations from each continent.
After that, operations like in `summarize()` will happen separately for each group,
which is how we ended up with per-continent means.

Finally, another powerful feature is that we can _group by multiple variables_ --
for example, by `year` _and_ `continent`:

```{r}
gapminder %>%
  group_by(continent, year) %>%
  summarize(mean_gdpPercap = mean(gdpPercap),
            mean_lifeExp = mean(lifeExp))
```

<br>
<br>

-----

### Challenge 3

Calculate the average life expectancy per country.
Which has the longest average life expectancy and which has the shortest
average life expectancy?

<details><summary>Click for the solution</summary>

First, let's create a dataframe with the mean life expectancy by country:

```{r}
lifeExp_bycountry <- gapminder %>%
   group_by(country) %>%
   summarize(mean_lifeExp = mean(lifeExp))
```

Then, arrange that dataframe in two directions to see the countries with the
longest and shortest life expectance --
piping into `head()` as a bonus to only see the top n, here top 1:

```{r}
lifeExp_bycountry %>%
   arrange(mean_lifeExp) %>%
   head(n = 1)

lifeExp_bycountry %>%
   arrange(desc(mean_lifeExp)) %>%
   head(n = 1)
```

</details>

-----

<br>
<br>

## Bonus: `count()` and `n()`

A very common operation is to count the number of observations for each group.
The _dplyr_ package comes with two related functions that help with this.

For instance, if we wanted to check the number of countries included in the
dataset for the year 2002, we can use the `count()` function.
It takes the name of one or more columns that contain the groups we are interested in,
and we can optionally sort the results in descending order by adding `sort = TRUE`:

```{r}
gapminder %>%
    filter(year == 2002) %>%
    count(continent, sort = TRUE)
```

If we need to use the number of observations in calculations,
the `n()` function is useful.
It will return the total number of observations in the current group rather than
counting the number of observations in each group within a specific column.
For instance, if we wanted to get the standard error of the life expectancy per continent:

```{r}
gapminder %>%
  group_by(continent) %>%
  summarize(se_life = sd(lifeExp) / sqrt(n()))
```

<br>

## Learn more

This material was adapted from
[this Carpentries lesson episode](https://swcarpentry.github.io/r-novice-gapminder/instructor/12-dplyr.html).

In your journey to become a skilled data frame wrangler in R,
here are some additional topics that are very useful but beyond the scope of this
workshop:

- Joining/merging -- combining multiple dataframes based on one or more shared
  columns. This can be done with _dplyr_'s `join_*()` functions.

- Pivoting/reshaping -- moving between 'wide' and 'long' data formats
  with `pivot_wider()` and `pivot_longer()` -- this is covered in
  [episode 13 of our focal Carpentries lesson](https://swcarpentry.github.io/r-novice-gapminder/instructor/13-tidyr.html).