-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path3_dplyr.qmd
489 lines (355 loc) · 14 KB
/
3_dplyr.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
---
title: "Data Frame Manipulation with _dplyr_"
author: "Author: Software Carpentry, with edits by Jelmer Poelstra. Date:"
date: 2024-08-19
format: gfm
execute:
keep-md: true
editor_options:
chunk_output_type: console
---
<br>
## The _dplyr_ package and the tidyverse
The [_dplyr_](https://cran.r-project.org/package=dplyr) package provides a number
of very useful functions for **manipulating data frames**.
Here we're going to cover some of the most commonly used functions,
and will also use pipes (`%>%`) to combine them.
- `select()` to pick columns (variables)
- `filter()` to pick rows (observations)
- `rename()` to change column names
- `arrange()` to change the order of rows (i.e., to sort a data frame)
- `mutate()` to modify values in columns and create new columns
- `summarize()` (with `group_by()`) to compute summaries across rows
Importantly, all these functions take a data frame as the _input_,
and return a new, modified data frame as the output.
_dplyr_ belongs to a broader family of R packages designed for "dataframe-centric"
data science called the "Tidyverse".
The other tidyverse package that we'll cover in today's workshop is _ggplot2_ for making plots.
You can find more information about the Tidyverse here:
[https://www.tidyverse.org/](https://www.tidyverse.org/).
All of the tidyverse's packages can be loaded at once as follows:
```{r}
library(tidyverse)
```
That will print quote some output which tells you which packages have been loaded
as part of the tidyverse, and which tidyverse functions "conflict with"
(in the sense of having the same name as)
R functions that were already in your environment.
<br>
## The `gapminder` data set
In this section and in the afternoon, we will work with the `gapminder` data set.
This data set is available in a package
(while most packages are built around functions so as to extend R's functionality,
others, like this one, merely contain data sets),
and we can load it as follows:
```{r}
library(gapminder)
```
Let's take a look at the dataset,
which is stored in a data frame also called `gapminder`:
```{r}
gapminder
```
The `gapminder` data frame is a so-called "tibble",
which is the tidyverse version of a data frame.
The main difference is the nicer default printing behavior of tibbles:
e.g. the data types of columns are shown,
and only a limited number of rows will be printed (hence `1,694 more rows`).
As for the dataset itself, note that each row contains some data
for a single country in a specific year
(across 5-year intervals between 1952 and 2007):
- `lifeExp` is the life expectancy in years
- `pop` is the population size
- `gdpPercap` is the per-capita GDP
<br>
## `select()` to pick columns (variables)
To subset a data frame by keeping or removing certain columns,
we can use the `select()` function.
By default, this function will only **keep the columns that you specify**,
which you typically do simply by listing those columns by name:
```{r}
select(.data = gapminder, year, country, gdpPercap)
```
In the command above, the first argument was the data frame,
whereas the other arguments were the (unquoted!) names of columns we wanted to keep.
The order of the columns in the output data frame is exactly as you list them in
`select()`, and doesn't need to be the same as in the input data frame.
In other words, `select()` is also one way to reorder columns.
In the example above, we moved year to come before country, for example.
We can also **specify columns that should be removed**,
by prefacing their name with a `!` (or a `-`):
```{r}
select(.data = gapminder, !continent)
```
There are also ways to select _ranges_ of columns,
and to match columns by their _partial names_,
but that is beyond the scope of this short workshop
(check the `select()` help by typing `?select` to learn more about this).
<br>
## `rename()` to change column names, and the pipe (`%>%`)
Our next _dplyr_ function is one of the simplest:
`rename()` to change column names.
The syntax to specify the new and old name within the function is `new_name = old_name`.
For example, building on the column selection we did above,
we may want to rename the `gdpPercap` column:
```{r}
gapminder_sel <- select(.data = gapminder, year, country, gdpPercap)
rename(.data = gapminder_sel, gdp_per_capita = gdpPercap)
```
It is common to use several (_dplyr_) functions in succession to "wrangle" a dataframe
into a format, and with the data, that we want.
To do so, we could go on like we did above,
successively assigning new data frames and moving on to the next step.
But there is a nicer way of dong this,
using so-called "piping" with a pipe operator:
we will use the `%>%` pipe operator.
Let's start by seeing pipes into action with a reformulation of the code
we used above to first select 3 columns and then rename 1 of them:
```{r}
gapminder %>%
select(year, country, gdpPercap) %>%
rename(gdp_per_capita = gdpPercap)
```
What happened here is that we took the `gapminder` data frame,
pushed (or "piped") it into the `select()` function,
whose output was in turn piped into the `rename()` function.
You can think of the pipe as **"then"**:
take `gapminder`, _then_ select, _then_ rename.
When using pipes, piped input replaces our previous way of specifying the input
with the `.data` argument.
Under the hood, when you pipe something into a function,
this will by default be passed to **the first argument of the function**.
_dplyr_ (and other tidyverse) functions are quite consistent
with the first argument always being a data frame,
which makes them "pipe-friendly".
Using pipes is slightly less typing and more readable than successive assignments.
(It is also faster and uses less computer memory.)
<br>
## `filter()` to pick rows (observations)
The `filter()` function can be used to keep only rows that match a condition.
Whereas column-selection often simply happens by column name as we've seen
with `select()`, row-selection tends to be a more intricate and interesting topic.
Let's start with the following example,
where we want to keep observations (remember, these are countries in a given year)
that have a life expectancy (column `lifeExp`) greater than 80 years:
```{r}
gapminder %>%
filter(lifeExp > 80)
```
So, we specify a _condition_ based on the values in at least one column,
and only the rows that satisfy this condition will be kept.
These conditions don't have to be based on numeric comparisons -- for example:
```{r}
gapminder %>%
filter(continent == "Europe")
```
(Remember to use _two_ equals signs `==` to test for equality!)
### Filter based on multiple conditions
It's also possible to filter based on multiple conditions:
```{r}
gapminder %>%
filter(continent == "Asia", year == 2007, lifeExp > 80)
```
By default, multiple conditions are combined in an _AND_ fashion ---
in other words, in a given row, _each_ condition needs to be met for that column
to be kept.
If you want to combine conditions in an _OR_ fashion,
you should use a `|` as follows:
```{r}
gapminder %>%
filter(lifeExp > 80 | gdpPercap > 10000)
```
Finally, let's practice a bit more with pipelines that use multiple _dplyr_ verbs:
```{r}
gapminder %>%
filter(continent == "Americas") %>%
select(year, country, gdpPercap) %>%
rename(gdp_per_capita = gdpPercap)
```
<br>
<br>
### Challenge 1
Write a single command (which can span multiple lines and include pipes)
that will produce a data frame that has `lifeExp`, `country`, and `year`
for Africa but not for other continents.
How many rows does your data frame have?
<details><summary>Click for the solution</summary>
```{r}
gapminder %>%
filter(continent == "Africa") %>%
select(year, country, lifeExp)
```
It has 624 rows.
</details>
<br>
<br>
## `arrange()` to sort data frames
The `arrange()` function is like the sort function in Excel:
it changes the order of the rows based on the values in one or more columns.
For example, our data set `gapminder` is currently sorted alphabetically by `country`
and then by `year`, but we may instead want to sort observations by population size:
```{r}
gapminder %>%
arrange(pop)
```
Sorting can be useful to see the observations with the smallest or largest values
for a certain column: above we see that the country and year with the smallest
population size is Sao Tome and Principe in 1952.
Default sorting is from small to large, but of course,
we may also want to sort in the reverse order.
You can do this using the `desc()` (descending, large-to-small) helper function:
```{r}
gapminder %>%
arrange(desc(pop))
```
Finally, it is common to want to sort by multiple columns,
where ties in the first column are broken by a second column (and so on) --
to do so, simply list the columns in the appropriate order:
```{r}
gapminder %>%
arrange(continent, country)
```
The above example sorts first by continent and _then_ by country.
<br>
## `mutate()` to modify values in columns and create new columns
So far, we've focused on functions that "merely" subset and reorganize data frames.
We've also seen how we can modify column names.
But we haven't seen how we can _change the data_ or _compute derived data_
in data frames.
We can do this with the `mutate()` function.
For example, say that we want to create a new column that has the population size
in millions rather than in individuals:
called `pop_million` that is the result
of dividing the values in the `pop` column by a million.
```{r}
gapminder %>%
mutate(pop_million = pop / 10^6)
```
So, the code above created a new column called `pop_million` that is the result
of dividing the values in the `pop` column by a million.
To modify a column rather than adding a new one,
simply assign back to the same name:
```{r}
gapminder %>%
mutate(pop = pop / 10^6)
```
<br>
<br>
------
### Challenge 2
**A:** Create a new column called `gdp_billion` that has the absolute GDP
(i.e., not relative to population size) in units of billions.
<details><summary>Click for the solution</summary>
```{r}
gapminder %>%
mutate(gdp_billion = gdpPercap * pop / 10^9)
```
</details>
**B:** (Bonus) Create a new column `planet` that has the value `earth` in every row.
<details><summary>Click for the solution</summary>
If you simply provide a value, this will be repeated in every row:
```{r}
gapminder %>%
mutate(planet = "earth")
```
</details>
------
<br>
<br>
## `summarize()` to compute (per-group) summary statistics
In combination with `group_by()`,
the `summarize()` function can compute data summaries across groups of rows of a
data frame.
First, let's see what `summarize()` does when used by itself:
```{r}
gapminder %>%
summarize(mean_gdpPercap = mean(gdpPercap),
mean_lifeExp = mean(lifeExp))
```
Above, we computed the mean for two columns, across all rows.
This is already useful,
but in combination with the helper function `group_by()`,
`summarize()` becomes really powerful.
For example, let's compute the mean GDP and mean life expectancy separately
for each continent:
```{r}
gapminder %>%
group_by(continent) %>%
summarize(mean_gdpPercap = mean(gdpPercap),
mean_lifeExp = mean(lifeExp))
```
`group_by()` implicitly splits a data frame into groups of rows:
here, one group for observations from each continent.
After that, operations like in `summarize()` will happen separately for each group,
which is how we ended up with per-continent means.
Finally, another powerful feature is that we can _group by multiple variables_ --
for example, by `year` _and_ `continent`:
```{r}
gapminder %>%
group_by(continent, year) %>%
summarize(mean_gdpPercap = mean(gdpPercap),
mean_lifeExp = mean(lifeExp))
```
<br>
<br>
-----
### Challenge 3
Calculate the average life expectancy per country.
Which has the longest average life expectancy and which has the shortest
average life expectancy?
<details><summary>Click for the solution</summary>
First, let's create a dataframe with the mean life expectancy by country:
```{r}
lifeExp_bycountry <- gapminder %>%
group_by(country) %>%
summarize(mean_lifeExp = mean(lifeExp))
```
Then, arrange that dataframe in two directions to see the countries with the
longest and shortest life expectance --
piping into `head()` as a bonus to only see the top n, here top 1:
```{r}
lifeExp_bycountry %>%
arrange(mean_lifeExp) %>%
head(n = 1)
lifeExp_bycountry %>%
arrange(desc(mean_lifeExp)) %>%
head(n = 1)
```
</details>
-----
<br>
<br>
## Bonus: `count()` and `n()`
A very common operation is to count the number of observations for each group.
The _dplyr_ package comes with two related functions that help with this.
For instance, if we wanted to check the number of countries included in the
dataset for the year 2002, we can use the `count()` function.
It takes the name of one or more columns that contain the groups we are interested in,
and we can optionally sort the results in descending order by adding `sort = TRUE`:
```{r}
gapminder %>%
filter(year == 2002) %>%
count(continent, sort = TRUE)
```
If we need to use the number of observations in calculations,
the `n()` function is useful.
It will return the total number of observations in the current group rather than
counting the number of observations in each group within a specific column.
For instance, if we wanted to get the standard error of the life expectancy per continent:
```{r}
gapminder %>%
group_by(continent) %>%
summarize(se_life = sd(lifeExp) / sqrt(n()))
```
<br>
## Learn more
This material was adapted from
[this Carpentries lesson episode](https://swcarpentry.github.io/r-novice-gapminder/instructor/12-dplyr.html).
In your journey to become a skilled data frame wrangler in R,
here are some additional topics that are very useful but beyond the scope of this
workshop:
- Joining/merging -- combining multiple dataframes based on one or more shared
columns. This can be done with _dplyr_'s `join_*()` functions.
- Pivoting/reshaping -- moving between 'wide' and 'long' data formats
with `pivot_wider()` and `pivot_longer()` -- this is covered in
[episode 13 of our focal Carpentries lesson](https://swcarpentry.github.io/r-novice-gapminder/instructor/13-tidyr.html).