-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path3_dplyr.qmd
736 lines (531 loc) · 22.2 KB
/
3_dplyr.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
---
title: "Data frame manipulation with _dplyr_"
author: "Software Carpentry / Jelmer Poelstra"
date: 2025-02-11
editor_options:
chunk_output_type: console
---
-------
<br>
## Introduction
### The _dplyr_ package and the tidyverse
The [_dplyr_](https://cran.r-project.org/package=dplyr) package provides a number
of very useful functions for **manipulating data frames**.
In this session, we'll cover some of the most commonly used functions:
- `select()` to pick columns (variables)
- `filter()` to pick rows (observations)
- `rename()` to change column names
- `arrange()` to change the order of rows (i.e., to sort a data frame)
- `mutate()` to modify values in columns and create new columns
- `summarize()` to compute across-row summaries
All these functions take a
**data frame as the _input_, and output a new, modified data frame**.
_dplyr_ belongs to a family of R packages designed for "dataframe-centric"
data science called the ["tidyverse"](https://www.tidyverse.org).
The other tidyverse package we'll cover in today's workshop is _ggplot2_ for making plots.
### Setting up
Let's start a new script for this session:
1. _Open a new R script_ (Click the **`+`** symbol in toolbar at the top, then click `R Script`)^[
Or Click `File` => `New file` => `R Script`.].
2. _Save the script_ straight away as `data-structures.R` --
you can save it anywhere you like, though it is probably best to save it in a
folder specifically for this workshop.
3. If you want the section headers as comments in your script,
as in the script I am showing you now,
then copy-and-paste the following into your script:
<details><summary>Section headers for your script _(Click to expand)_</summary>
```{r}
# 1 - Introduction -------------------------------------------------------------
# 1.3 - Loading the tidyverse
# 1.4 - The gapminder dataset
# 2 - select() to pick columns (variables) -------------------------------------
# 3 - rename() to change column names ------------------------------------------
# 4 - The pipe |> --------------------------------------------------------------
# 5 - filter() to pick rows (observations) -------------------------------------
# Challenge 1
# Write a single command (which can span multiple lines and include pipes) that
# will produce a data frame that has lifeExp, country, and year for Africa but not
# for other continents.
# How many rows does your data frame have?
# 6 - arrange() to sort data frames --------------------------------------------
# 7 - mutate() to modify values in columns and create new columns --------------
# Challenge 2
# A: Use mutate() to create a new column called gdp_billion that has the absolute
# GDP (i.e., not relative to population size) and does so in units of billions
# (i.e., 2.3 would mean 2.3 billion).
# B: (Bonus) Use mutate() to create a new column planet that has the value earth
# in every row. We didn’t cover this kind of scenario but go ahead and test
# your intuition!
# 8 - summarize() to compute group-wise summary stats --------------------------
# Challenge 3
# Calculate the average life expectancy per country. Which has the longest average
# life expectancy and which has the shortest average life expectancy?
```
</details>
<hr style="height:1pt; visibility:hidden;" />
### Loading the tidyverse
All core tidyverse packages can be installed and loaded with a single command.
Since you should already have _installed_ the tidyverse^[
If not: run `install.packages("tidyverse")` now.
], you only need to _load_ it, which you do as follows:
```{r}
library(tidyverse)
```
The output tells you which packages have been loaded as part of the tidyverse.
::: {.callout-note collapse="true"}
#### Learn more: Function name "conflicts" _(Click to expand)_
Also printed above is a "Conflicts" section that may seem ominous.
What this means is that two tidyverse functions, `filter()` and `lag()`,
have the same names as two functions from the _stats_ package that were already
in your R environment.
(Those stats package functions are part of what is often referred to as "base R":
core R functionality that is always available (loaded) when you start R.)
Due to this function name conflict/collision, for example,
the `filter()` function from _dplyr_ "masks" the `filter()` function from _stats_:
that is, if you write a command with `filter()`, it will use the dplyr function
and not the stats function.
You _can_ still use a "masked" function, but you will need to preface it with its
package name as follows: `stats::filter()`.
:::
<hr style="height:1pt; visibility:hidden;" />
### The `gapminder` data set
In this session and the next one on data visualization,
we will work with the `gapminder` data set.
This data set is available in a package of the same name
(while most packages are built around functions so as to extend R's functionality,
others merely contain data sets).
This package, too, you should have already installed^[
If not: run `install.packages("gapminder")` now.
], so you only need to load it right now:
```{r}
library(gapminder)
```
Let's take a look at the dataset:
```{r}
gapminder
```
We can also use the `View()` function to look at the data frame ---
this will open a new tab in our editor pane with a spreadsheet-like look and feel:
```{r, eval=FALSE}
View(gapminder)
# (Should display the dataset in an editor pane tab)
```
The `gapminder` data frame is a so-called
**"tibble", which is the tidyverse variant of a data frame.**
The main difference is the nicer default printing behavior of tibbles:
e.g. the data types of columns are shown,
and only a limited number of rows are printed.
As for the dataset itself, note that each row contains some data
for a single country in a specific year
(across 5-year intervals between 1952 and 2007), with the following columns:
- `country` (stored as a `factor`)
- `continent` (stored as a `factor`)
- `year` (stored as an `integer`)
- `lifeExp` (stored as an `integer`): the **life expectancy** in years
- `pop` (stored as an `integer`): the **population size**
- `gdpPercap` (stored as a `double`): the **per-capita GDP**
<br>
## `select()` to pick columns (variables)
To subset a data frame by keeping or removing certain columns,
we can use the `select()` function.
By default, this function will only **keep the columns that you specify**,
which you typically do simply by listing those columns by name:
```{r}
select(.data = gapminder, year, country, gdpPercap)
```
In the command above, the first argument was the data frame,
whereas the other arguments were the (unquoted!) names of columns we wanted to keep.
The **order of the columns** in the output data frame is exactly as you list them in
`select()`, and doesn't need to be the same as in the input data frame.
In other words, `select()` is also one way to reorder columns.
In the example above, we made `year` appear before `country`, for example.
We can also **specify columns that should be removed**,
by prefacing their name with a `!` (or a `-`):
```{r}
select(.data = gapminder, !continent)
```
<hr style="height:1pt; visibility:hidden;" />
::: {.callout-tip appearance='minimal'}
There are also ways to e.g. select _ranges_ of columns,
but that is beyond the scope of this short workshop.
Check the `select()` help by typing `?select` to learn more.
:::
<br>
## `rename()` to change column names
The next _dplyr_ function is one of the simplest:
`rename()` to change column names.
The syntax to specify the new and old name within the function is `new_name = old_name`
--- for example, we may want to rename the `gdpPercap` column:
```{r}
rename(.data = gapminder, gdp_per_capita = gdpPercap)
```
We can also rename multiple columns at once:
```{r}
rename(.data = gapminder, gdp_per_capita = gdpPercap, life_exp = lifeExp)
```
<br>
## The pipe (`|>`)
So far, our examples applied a single _dplyr_ function to a data frame,
and simply printed the output (a new data frame) to screen.
But in practice, it is common to use several functions in succession to "wrangle"
a dataframe into the format we want.
For example, if we want to first `select()` one or more columns,
and then modify the output further by `rename()`-ing a column, we could:
```{r}
gapminder_sel <- select(.data = gapminder, year, country, gdpPercap)
rename(.data = gapminder_sel, gdp_per_capita = gdpPercap)
```
For more elaborate editing, we could go on like we did above,
successively assigning new data frames and moving on to the next step.
But there is a nicer way of dong this,
using so-called "piping" with a **pipe operator** (`|>`).
Let's start by seeing a reformulation of the code above with pipes:
```{r}
gapminder |>
select(year, country, gdpPercap) |>
rename(gdp_per_capita = gdpPercap)
```
What happened here is that we took the `gapminder` data frame,
pushed (or "piped") it into the `select()` function,
whose output was in turn piped into the `rename()` function.
You can think of the pipe as **"then"**:
take `gapminder`, _then_ select, _then_ rename.
Using pipes is slightly less typing and considerably more readable than using
successive assignments like we did before.^[
Using pipes is also faster and uses less computer memory.
]
When using pipes, we no longer specify the input data frame with the `.data`
argument, because the input is "already" being piped into the function
(by default, the input goes to the function's first argument).
::: {.callout-tip appearance='minimal'}
For code readability, it is good practice to always start a new line after a pipe `|>`
:::
<br>
## `filter()` to pick rows (observations)
The `filter()` function keeps only those rows that satisfy one or more conditions.
It is similar to Filter functionality in spreadsheets ---
except that those only change what you _display_, while `filter()` will _remove_ rows.
But if that sounds scary, recall what we mentioned earlier:
> All these functions take a data frame as the _input_,
> and output a new, modified data frame.
:::{.callout-warning appearance="minimal"}
Therefore, you would _only_ modify existing data if you would assign the output
back to the input object, like in the example below with the `select()` function:
```{r, eval=FALSE}
cats <- cats |> select(name, coat)
```
:::
Let's start with the following example,
where we want to keep observations (remember, these are countries in a given year)
with a life expectancy exceeding 80 years:
```{r}
gapminder |>
filter(lifeExp > 80)
```
So, we specify a _condition_ based on the values in at least one column
to only **keep rows satisfying this condition**.
These conditions don't have to be based on numeric comparisons -- for example:
```{r}
gapminder |>
filter(continent == "Europe")
```
:::callout-warning
### Remember to use _two_ equals signs `==` to test for equality!
:::
<hr style="height:1pt; visibility:hidden;" />
#### Filter based on multiple conditions
It's also possible to filter based on multiple conditions -- for example:
```{r}
gapminder |>
filter(continent == "Asia", year == 2007, lifeExp > 80)
```
By default, multiple conditions are combined in an _AND_ fashion ---
in other words, in a given row, _each_ condition needs to be met for that column
to be kept.
If you want to combine conditions in an _OR_ fashion,
you should use a `|` as follows:
```{r}
gapminder |>
filter(lifeExp > 80 | gdpPercap > 10000)
```
<hr style="height:1pt; visibility:hidden;" />
#### Pipeline practice
Finally, let's practice a bit more with "pipelines" that use multiple _dplyr_ verbs:
```{r}
gapminder |>
filter(continent == "Americas") |>
select(year, country, gdpPercap) |>
rename(gdp_per_capita = gdpPercap)
```
<br>
:::exercise
### {{< fa user-edit >}} Challenge 1 {-}
Write a single command (which can span multiple lines and include pipes)
that will produce a data frame that has `lifeExp`, `country`, and `year`
for Africa but not for other continents.
How many rows does your data frame have?
<details><summary>Click for the solution</summary>
```{r}
gapminder |>
filter(continent == "Africa") |>
select(year, country, lifeExp)
```
It has 624 rows.
</details>
:::
<br>
## `arrange()` to sort data frames
The `arrange()` function is like sorting functionality in Excel:
it changes the order of rows based on the values in one or more columns.
For example, `gapminder` is currently sorted alphabetically by `country`
and then by `year`, but we may instead want to sort by population size:
```{r}
gapminder |>
arrange(pop)
```
Sorting can be useful to see the observations with the smallest or largest values
for a certain column: above we see that the country and year with the smallest
population size is _Sao Tome and Principe_ in 1952.
Default sorting is from small to large, but of course,
we may also want to sort in the reverse order.
You can do this using the `desc()` (descending, large-to-small) helper function:
```{r}
gapminder |>
arrange(desc(pop))
```
Finally, it is common to want to sort by multiple columns,
where ties in the first column are broken by a second column (and so on) --
to do so, simply list the columns in the appropriate order:
```{r}
gapminder |>
arrange(continent, country)
```
The above example sorts first by continent and _then_ by country.
<br>
## `mutate()` to modify values in columns and create new columns
So far, we've focused on functions that "merely" subset and reorganize data frames.
We've also seen how we can modify column names.
But we haven't seen how we can _change the data_ or _compute derived data_
in data frames.
We can do this with the `mutate()` function.
For example, say that we want to create a new column that has population sizes
in millions rather than in individuals:
```{r}
gapminder |>
mutate(pop_million = pop / 10^6)
```
The code above created a new column called `pop_million` that is the result
of dividing the values in the `pop` column by a million.
To modify a column rather than adding a new one,
simply "assign back to the same name":
```{r}
gapminder |>
mutate(pop = pop / 10^6)
```
<br>
:::exercise
### {{< fa user-edit >}} Challenge 2 {-}
**A:** Use `mutate()` to create a new column called `gdp_billion` that has the
**absolute** GDP (i.e., not relative to population size) and does so
in **units of billions** (i.e., `2.3` would mean 2.3 billion).
<details><summary>Click for the solution</summary>
```{r}
gapminder |>
mutate(gdp_billion = gdpPercap * pop / 10^9)
```
</details>
**B:** _(Bonus)_ Use `mutate()` to create a new column `planet` that has the value
`earth` in every row. We didn't cover this kind of scenario, but go ahead and test
your intuition!
<details><summary>Click for the solution</summary>
If you simply provide a value, this will be repeated in every row:
```{r}
gapminder |>
mutate(planet = "earth")
```
</details>
:::
<br>
## `summarize()` to compute (groupwise) summary stats
In combination with `group_by()`,
the `summarize()` function can compute data summaries across groups of rows of a
data frame.
First, let's see what `summarize()` does when used by itself:
```{r}
gapminder |>
summarize(mean_gdpPercap = mean(gdpPercap),
mean_lifeExp = mean(lifeExp))
```
Above, we computed the mean for two columns, across **all** rows.
This is already useful,
but in combination with the helper function `group_by()`,
`summarize()` becomes really powerful by allowing us to compute **groupwise** stats.
For example, let's compute the mean GDP and mean life expectancy separately
for each continent:
```{r}
gapminder |>
group_by(continent) |>
summarize(mean_gdpPercap = mean(gdpPercap),
mean_lifeExp = mean(lifeExp))
```
`group_by()` implicitly splits a data frame into groups of rows:
here, one group for observations from each continent.
After that, operations like in `summarize()` will happen separately for each group,
which is how we ended up with per-continent means.
Finally, another powerful feature is that we can _group by multiple variables_ --
for example, by `year` _and_ `continent`:
```{r}
gapminder |>
group_by(continent, year) |>
summarize(mean_gdpPercap = mean(gdpPercap),
mean_lifeExp = mean(lifeExp))
```
<br>
:::exercise
### {{< fa user-edit >}} Challenge 3 {-}
Calculate the average life expectancy per country.
Which has the longest average life expectancy and which has the shortest
average life expectancy?
<details><summary>Click for the solution</summary>
First, let's create a dataframe with the mean life expectancy by country:
```{r}
lifeExp_bycountry <- gapminder |>
group_by(country) |>
summarize(mean_lifeExp = mean(lifeExp))
```
Then, arrange that dataframe in two directions to see the countries with the
longest and shortest life expectance --
piping into `head()` as a bonus to only see the top n, here top 1:
```{r}
lifeExp_bycountry |>
arrange(mean_lifeExp) |>
head(n = 1)
lifeExp_bycountry |>
arrange(desc(mean_lifeExp)) |>
head(n = 1)
```
</details>
:::
<br>
------
------
<br>
## Bonus material for self-study
### Writing and reading tabular data to and from files
When working with your own data in R,
it is common to want to read data from files into your R environment,
and write data that is in your R environment to files.
While it's possible to have R interact with Excel spreadsheet files^[
Using the [readxl](https://readxl.tidyverse.org/) package.
This is installed as part of the tidyverse but is not a _core_ tidyverse package
and therefore needs to be loaded separately.
], keeping your data in **plain-text files** generally benefits reproducibility.
Tabular plain text files can be stored using:
- A _Tab_ as the column delimiter
(often called TSV files, and stored with a `.tsv` extension)
- A _comma_ as the column delimiter
(often called CSV files, and stored with a `.csv` extension).
We will use functions from the _readr_ package,
which is part of the core tidyverse and therefore already loaded into your environment,
to practice writing and reading data to and from TSV files.
#### Writing files
First, let's write the `gapminder` data frame to file using the `write_tsv()`
function with arguments `x` for the R object and `file` for the file path:
```{r}
write_tsv(x = gapminder, file = "gapminder.tsv")
```
This will have written to a file called `gapminder.tsv` in your R working directory
(you can see where that is at the top of the Console, or by running `getwd()`).
::: callout-tip
### Checking out the file
In RStudio's **Files pane**,
find your new file and click on it.
That way, the file will open in the editor.
:::
::: {.callout-note collapse="true"}
### Want to write to a different folder on your computer? _(Click to expand)_
The `file` argument to `write_tsv()` takes a file _path_,
meaning that you can specify any location on your computer for it it in addition
to its name.
For example, if you had a folder called `results` in your current working directory
(directory is just another word for folder), you could put the file in there:
```{r, eval=FALSE}
write_tsv(x = gapminder, file = "results/gapminder.tsv")
```
Note that a forward slash `/` as a folder delimiter will work regardless of your
operating system
(even though Windows natively delimits folder by a backslash `\`).
Finally, if you want to store the file in a totally different place than where
you are now, note that you can also use so-called absolute (or "full") paths like:
```{r, eval=FALSE}
write_tsv(x = gapminder, file = "/Users/poelstra.1/Desktop/gapminder.tsv")
```
:::
<hr style="height:1pt; visibility:hidden;" />
#### Reading from files
To practice reading data from a file,
we'll use the `read_tsv()` function on the file we just created:
```{r}
gapminder_reread <- read_tsv(file = "gapminder.tsv")
```
Note that the function is rather chatty by default,
telling us how many rows and columns it read in, and what their data types are.
Let's check the resulting object:
```{r}
gapminder_reread
```
This looks good!
Do note that the column's data types are not identical to what they were
(`year` and `pop` are saved as `double` rather than `integer`,
and `country` and `continent` as `character` rather than `factor`).
This is largely expected because that kind of metadata is not stored in a plain-text
TSV, so `read_tsv()` will by default simply make best guesses as to the types.
Alternatively, we could tell read_tsv what the column types should be using
abbreviations (`f` for `factor`, `i` for `integer`, `d` for `double` ---
run `?read_tsv` for more info):
```{r}
read_tsv(file = "gapminder.tsv", col_types = "ffidi")
```
::: {.callout-tip appearance='minimal'}
To read and write CSV files instead, use `read_csv()` / `write_csv()` in the same way.
:::
<hr style="height:1pt; visibility:hidden;" />
### `count()` and `n()`
A common operation is to count the number of observations for each group.
The _dplyr_ package comes with two related functions that help with this.
For instance, if we wanted to check the number of countries included in the
dataset for the year 2002, we can use the `count()` function.
It takes the name of one or more columns that contain the groups we are interested in,
and we can optionally sort the results in descending order by adding `sort = TRUE`:
```{r}
gapminder |>
filter(year == 2002) |>
count(continent, sort = TRUE)
```
If we need to use the number of observations in calculations like by `summarize()`,
the `n()` function is useful.
It will return the total number of observations in the "_current group_"
as applicable ---
for instance,
if we wanted to get the standard error of the life expectancy per continent:
```{r}
gapminder |>
group_by(continent) |>
summarize(se_life = sd(lifeExp) / sqrt(n()))
```
<hr style="height:1pt; visibility:hidden;" />
### Learn more
This material was adapted from
[this Carpentries lesson episode](https://swcarpentry.github.io/r-novice-gapminder/instructor/12-dplyr.html).
In your journey to become a skilled data frame wrangler in R,
here are some additional topics that are very useful but beyond the scope of this
workshop:
- Joining/merging -- combining multiple dataframes based on one or more shared
columns. This can be done with _dplyr_'s `join_*()` functions.
- Pivoting/reshaping -- moving between 'wide' and 'long' data formats
with `pivot_wider()` and `pivot_longer()` -- this is covered in
[episode 13 of our focal Carpentries lesson](https://swcarpentry.github.io/r-novice-gapminder/instructor/13-tidyr.html).
<br>
```{r, include=FALSE, echo=FALSE}
file.remove("gapminder.tsv")
```