-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path02-R-Intro.Rmd
454 lines (279 loc) · 23 KB
/
02-R-Intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
# Introduction to R, RStudio and R Markdown
```{r include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE, eval = TRUE)
```
In this lesson we will get a general introduction to coding in RStudio, using R Markdown, some R fundamentals such as data types and indexing, and touch on a range of coding topics that we will dive into deeper throughout the course.
## Getting to know RStudio
When you first open RStudio, it is split into 3 panels:
- **The Console** (left), where you can directly type and run code (by hitting Enter)
- **The Environment/History pane** (upper-right), where you can view the objects you currently have stored in your environment and a history of the code you've run
- **The Files/Plots/Packages/Help pane** (lower-right), where you can search for files, view and save your plots, view and manage what packages are loaded in your library and session, and get R help.
![Image Credit: Software Carpentry](https://swcarpentry.github.io/r-novice-gapminder/fig/01-rstudio.png){width="100%"}
<br>
To write and save code you use .R scripts (or RMarkdown, which we will learn shortly). You can open a new script with File -\> New File or by clicking the icon with the green plus sign in the upper left corner. When you open a script, RStudio then opens a fourth **'Source' panel** in the upper-left to write and save your code. You can also send code from a script directly to the console to execute it by highlighting the **entire** code line/chunk (or place your cursor at the end of the code chunk) and hit CTRL+ENTER on a PC or CMD+ENTER on a Mac.
![Image Credit: Software Carpentry](https://swcarpentry.github.io/r-novice-gapminder/fig/01-rstudio-script.png)
It is good practice to add comments/notes throughout your scripts to document what the code is doing. To do this start a line with a `#`. R knows to ignore everything after a `#`, so you can write whatever you want there. Note that R reads line by line, so if you want your comments to carry over multiple lines you need a `#` at every line.
<hr>
## R Projects
As a first step whenever you start a new project, workflow, analysis, etc., it is good practice to set up an R project. R Projects are RStudio's way of bundling together all your files for a specific project, such as data, scripts, results, figures. Your project directory also becomes your working directory, so everything is self-contained and easily portable.
We recommend using a single R Project (i.e., contained in a single folder) for this course, so lets create one now.
You can start an R project in an existing directory or in a new one. To create a project go to File -\> New Project:
![](images/project-start.png)
Let's choose 'New Directory' then 'New Project'. Now choose a directory name, this will be both the folder name and the project name, so use proper spelling conventions (**no spaces!**). We recommend naming it something course specific, like 'WR-696-2023', or even more generic 'Intro-R-Fall23'. Choose where on your local file system you want to save this new folder/project (somewhere you can find it easily), then click 'Create Project'.
Now you can see your RStudio session is working in the R project you just created. You can see the working directory printed at the top of your console is now the project directory, and in the 'Files' tab in RStudio you can see there is an .Rproj file with the same name as the R project, which will open up this R project in RStudio whenever you come back to it.
**Test out how this .Rproj file works**. Close out of your R session, navigate to the project folder on your computer, and double-click the .Rproj file.
::: {.alert .alert-info}
***What is a working directory?*** A working directory is the default file path to a specific file location on your computer to read files from or save files to. Since everyone's computer is unique, everyone's full file paths will be different. This is an advantage of working in R Projects, you can use *relative* file paths, since the working directory defaults to wherever the .RProj file is saved on your computer you don't need to specify the full unique path to read and write files within the project directory.
:::
<hr>
## Write a set-up script
Let's start coding!
The first thing you do in a fresh R session is set up your environment, which mostly includes installing and loading necessary libraries and reading in required data sets. Let's open a fresh R script and save it in our root (project) directory. Call this script 'setup.R'.
### Functions
Before creating a set up script, it might be helpful to understand the use of functions in R if you are new to this programming language. R has many built-in functions to perform various tasks. To run these functions you type the function name followed by parentheses. Within the parentheses you put in your specific arguments needed to run the function.
```{r}
# mathematical functions with numbers
log(10)
# average a range of numbers
mean(1:5)
# nested functions for a string of numbers, using the concatenate function 'c'
mean(c(1,2,3,4,5))
# functions with characters
print("Hello World")
paste("Hello", "World", sep = "-")
```
### Packages
R Packages include reusable functions that are not built-in with R. To use these functions, you must install the package to your local system with the `install.packages()` function. Once a package is installed on your computer you don't need to install it again (*you will likely need to update it at some point though*). Anytime you want to use the package in a new R session you load it with the `library()` function.
**When do I use** `::` ?
If you have a package installed, you don't necessarily have to load it in with `library()` to use it in your R session. Instead, you can type the package name followed by `::` and use any functions in that package. This may be useful for some one-off functions using a specific package, however if you will be using packages a lot throughout your workflow you will want to load it in to your session. You should also use `::` in cases where you have multiple packages loaded that may have conflicting functions (e.g., `plot()` in Base R vs. `plot()` in the {terra} package).
#### Base R vs. The Tidyverse
You may hear us use the terms 'Base R' and 'Tidyverse' a lot throughout this course. Base R includes functions that are installed with the R software and do not require the installation of additional packages to use them. The Tidyverse is a collection of R packages designed for data manipulation, exploration, and visualization that you are likely to use in every day data analysis, and all use the same design philosophy, grammar, and data structures. When you install the Tidyverse, it installs all of these packages, and you can then load all of them in your R session with `library(tidyverse)`. Base R and the Tidyverse have many similar functions, but many prefer the style, efficiency and functionality of the Tidyverse packages, and we will mostly be sticking to Tidyverse functions for this course.
#### Package load function
To make code reproducible (meaning anyone can run your code from their local machines) we can write a function that checks whether or not necessary packages are installed, if not install them and load them, or if they are already installed it will only load them and not re-install. This function looks like:
```{r}
packageLoad <-
function(x) {
for (i in 1:length(x)) {
if (!x[i] %in% installed.packages()) {
install.packages(x[i])
}
library(x[i], character.only = TRUE)
}
}
```
For each package name given ('x') it checks if it is already installed, if not installs it, and then loads that package into the session. In future lessons we will learn more about writing custom functions, and iterating with for loops, but for now you can copy/paste this function and put it at the top of your set up script. When you execute this chunk of code, you won't see anything printed in the console, however you should now see `packageLoad()` in your Environment under 'Functions'. You can now use this function as many times as you want. Test is out, and use it to install the Tidyverse package(s).
```{r warning=FALSE, results='hide'}
packageLoad('tidyverse')
```
You can also give this function a string of package names. Lets install all the packages we will need for the first week, or if you already followed the set up instructions, this will just load the packages into your session since you already installed them.
```{r}
# create a string of package names
packages <- c('tidyverse',
'palmerpenguins',
'rmarkdown')
# use the packageLoad function we created on those packages
packageLoad(packages)
```
Since this is code you will be re-using throughout your workflows, we will save it as its own script and run it at the beginning of other scripts/documents using the `source()` function as a part of our reproducible workflows.
<hr>
## R Markdown
Throughout this course you will be working mostly in R Markdown documents. R Markdown is a notebook style interface integrating text and code, allowing you to create fully reproducible documents and render them to various elegantly formatted static or dynamic outputs (which is how you will be submitting your assignments).
You can learn more at the R Markdown website, which has really informative lessons on the [Getting Started](https://rmarkdown.rstudio.com/lesson-1.html) page and you can see the range of outputs you can create at the [Gallery](https://rmarkdown.rstudio.com/gallery.html) page.
### What About Quarto?
Some of you may have heard of Quarto, which is essentially an extension of R Markdown but it lives as its own software to allow its use in other languages such as Python, Julia and Observable. You can install the Quarto CLI on its own and RStudio will detect it so you can create documents within the IDE, or alternatively with newer versions of RStudio a version of Quarto is built-in and you can enable Quarto through the R Markdown tab in Global Options. R Markdown isn't going anywhere, however many in the data science realm are switching to Quarto. Quarto documents are very similar to R Markdown, in fact Quarto can even render R Markdown documents, so after learning R Markdown in this course you should have some of the fundamental skills to easily switch to Quarto if you want to. You can read more about Quarto [here](https://quarto.org/).
### Getting started with R Markdown
Let's create a new document by going to File -\> New File -\> R Markdown. You will be prompted to add information like title and author, fill those in (let's call it "Intro to R and R Markdown") and keep the output as HTML for now. Click OK to create the document.
This creates an outline of an R Markdown document, and you see the title, author and date you gave the prompt at the top of the document which is called the YAML header.
Notice that the file contains three types of content:
- An (optional) YAML header surrounded by `---`s
- R code chunks surrounded by ```` ``` ````s
- text mixed with simple text formatting
Since this is a notebook style document, you run the code chunks by clicking the green play button in the top right corner of each code chunk, and then the output is returned directly below the chunk.
::: {.alert .alert-info}
If you'd rather have the code chunk output go to the console instead of directly below the chunk in your R Markdown document, go to Tools -\> Global Options -\> R Markdown and uncheck "Show output inline for all R Markdown documents"
:::
When you want to create a report from your notebook, you render it by hitting the 'knit' button at the top of the Source pane (with the ball of yarn next to it), and it will render to the format you have specified in the YAML header. In order to do so though, you need to have the {rmarkdown} package installed.
You can delete the rest of the code/text below the YAML header, and insert a new code chunk at the top. **You can insert code chunks by clicking the green C with the '+' sign at the top of the source editor, or with the keyboard short cut (Ctrl+Alt+I for Windows, Option+Command+I for Macs)**. For the rest of the lesson (and course) you will be writing and executing code through code chunks, and you can type any notes in the main body of the document.
The first chunk is almost always your set up code, where you read in libraries and any necessary data sets. Here we will execute our set up script to install and load all the libraries we need:
```{r eval=FALSE}
source("setup.R")
```
<hr>
## Explore
Normally when working with a new data set, the first thing we do is explore the data to better understand what we're working with. To do so, you also need to understand the fundamental data types and structures you can work with in R.
### The `penguins` data
For this intro lesson, we are going to use the Palmer Penguins data set (which is loaded with the {palmerpenguins} package you installed in your set up script). This data was collected and made available by [Dr. Kristen Gorman](https://www.uaf.edu/cfos/people/faculty/detail/kristen-gorman.php) and the [Palmer Station, Antarctica LTER](https://pallter.marine.rutgers.edu/), a member of the [Long Term Ecological Research Network](https://lternet.edu/).
Load the `penguins` data set.
```{r eval=TRUE}
data("penguins")
```
You now see it in the Environment pane. Print it to the console to see a snapshot of the data:
```{r}
penguins
```
<hr>
### Data Types
This data is structured as a data frame, probably the most common data type and one you are most familiar with. These are like Excel spreadsheets: tabular data organized by rows and columns. However we see at the top this is called a `tibble` which is just a fancy kind of data frame specific to the Tidyverse.
At the top we can see the data type of each column. There are five main data types:
- **character**: `"a"`, `"swc"`
- **numeric**: `2`, `15.5`
- **integer**: `2L` (the `L` tells R to store this as an integer)
- **logical**: `TRUE`, `FALSE`
- **complex**: `1+4i` (complex numbers with real and imaginary parts)
Data types are combined to form data structures. R's basic data structures include:
- atomic vector
- list
- matrix
- data frame
- factors
You can see the data type or structure of an object using the `class()` function, and get more specific details using the `str()` function. (Note that 'tbl' stands for tibble).
```{r}
class(penguins)
str(penguins)
```
```{r}
class(penguins$species)
str(penguins$species)
```
When we pull one column from a data frame like we just did above using the `$` operator, that returns a vector. Vectors are 1-dimensional, and must contain data of a single data type (i.e., you cannot have a vector of both numbers and characters).
If you want a 1-dimensional object that holds mixed data types and structures, that would be a list. You can put together pretty much anything in a list.
```{r}
myList <- list("apple", 1993, FALSE, penguins)
str(myList)
```
You can even nest lists within lists:
```{r}
list(myList, list("more stuff here", list("and more")))
```
You can use the `names()` function to retrieve or assign names to list and vector elements:
```{r}
names(myList) <- c("fruit", "year", "logic", "data")
names(myList)
```
### Indexing
Indexing is an extremely important aspect to data exploration and manipulation. In fact you already started indexing when we looked at the data type of individual columns with `penguins$species`. How you index is dependent on the data structure.
Index lists:
```{r}
# for lists we use double brackes [[]]
myList[[1]] # select the first stored object in the list
myList[["data"]] # select the object in the list named "data" (a data frame)
```
Index vectors:
```{r}
# for vectors we use single brackets []
myVector <- c("apple", "banana", "pear")
myVector[2]
```
Index data frames:
```{r eval=FALSE}
# dataframe[row(s), columns()]
penguins[1:5, 2]
penguins[1:5, "island"]
penguins[1, 1:5]
penguins[1:5, c("species","sex")]
penguins[penguins$sex=='female',]
# $ for a single column
penguins$species
```
::: {.alert .alert-info}
To index elements of a list you must use double brackets `[[ ]]`, and to index vectors and data frames you use single brackets `[ ]`
:::
#### Exercises
(*not required, but work through them if you want!*)
1. Why don't the following lines of code work? Tweak each one so the code runs
```{r eval=FALSE}
myList["Fruit"]
```
```{r eval=FALSE}
penguins$flipper_lenght_mm
```
```{r eval=FALSE}
penguins[island=='Dream',]
```
2. How many species are in the `penguins` data set? What islands were the data collected for? (Note: the `unique()` function might help)
3. Use indexing to create a new data frame that has only 3 columns: species, island and flipper length columns, and subset all rows for just the 'Dream' island.
4. Use indexing and the `mean()` function to find the average flipper length for the *Adelie* species on Dream island. (Note: explore the `mean()` function and how to deal with NA values).
<hr>
### The {dplyr} package
So far the code you've been writing has consisted of Base R functionality. Now lets dive into the Tidyverse with the {dplyr} package.
{dplyr} is a Tidyverse package to handle most of your data exploration and manipulation tasks. Now that you have learned indexing, you may notice the first two {dplyr} functions you are going to learn. `filter()` and `select()` act as indexing functions by subsetting rows and columns based on specified names and/or conditions.
**Subset rows with [`filter()`](https://dplyr.tidyverse.org/reference/filter.html)**
You can filter data in many ways using logical operators (`>`, `>=`, `<`, `<=`, `!=` (not equal), and `==` (equal)), AND (`&`), OR (`|`), and NOT (`!`) operators, and other operations such as `%in%`, which returns everything that matches at least one of the values in a given vector, and `is.na()` and `!is.na()` to return all missing or all non-missing data.
```{r eval=FALSE}
filter(penguins, species == "Adelie")
filter(penguins, species != "Adelie")
filter(penguins, island %in% c("Dream", "Torgersen") & !is.na(bill_length_mm))
```
**Note: Tidyverse package functions take in column names *without* quotations.**
::: {.alert .alert-info}
Using {dplyr} functions will not manipulate the original data, so if you want to save the returned object you need to assign it to a new variable.
:::
**Select columns with [`select()`](https://dplyr.tidyverse.org/reference/select.html){style="font-size: 13pt;"}**
`select()` has many helper functions you can use with it, such as `starts_with()`, `ends_with()`, `contains()` and many more that are very useful when dealing with large data sets. See `?select` for more details.
::: {.alert .alert-info}
Writing out `?` ahead of any function from a package will open a description of that function in the "Help" pane.
:::
```{r eval=FALSE}
# Select two specific variables
select(penguins, species, sex)
# Select a range of variables
select(penguins, species:flipper_length_mm)
# Rename columns within select
select(penguins, genus = species, island)
# Select column variables that are recorded in mm
select(penguins, contains("mm"))
```
**Create new variables with [`mutate()`](https://dplyr.tidyverse.org/reference/mutate.html){style="font-size: 13pt;"}**
```{r eval = FALSE}
# New variable that calculates bill length in cm
mutate(penguins, bill_length_cm = bill_length_mm/10)
# mutate based on conditional statements
mutate(penguins, species_sex = if_else(sex == 'male', paste0(species,"_m"), paste0(species, "_f")))
```
*Notice the use of `paste0()` here, and when we briefly used a similar function `paste()` in the 'Functions' section above. Explore the difference between these two. They are both very useful functions for pasting strings together.*
[**`group_by()`**](https://dplyr.tidyverse.org/reference/group_by.html) **and [`summarise()`](https://dplyr.tidyverse.org/reference/summarise.html)**
These can all be used in conjunction with `group_by()` which changes the scope of each function from operating on the entire data set to operating on it group-by-group. `group_by()` becomes even more powerful when used along with `summarise()` to calculate some specified summary statistic for each group. However before we start using multiple operations in conjunction with one another, we need to talk about the pipe operator `%>%`.
#### The pipe `%>%`
The pipe, `%>%`, comes from the **magrittr** package by Stefan Milton Bache. Packages in the Tidyverse load `%>%` for you automatically, so you don't usually load {magrittr} explicitly. Pipes are a powerful tool for clearly expressing a sequence of multiple operations.
For example, the pipe operator can take this sequence of operations:
```{r eval=FALSE}
df1 <- filter(penguins, island == "Dream")
df2 <- mutate(df1, flipper_length_cm = flipper_length_mm/10)
df3 <- select(df2, species, year, flipper_length_cm)
print(df3)
```
And turn it into this, removing the need to create intermediate variables
```{r eval = FALSE}
penguins %>%
filter(island == "Dream") %>%
mutate(flipper_length_cm = flipper_length_mm/10) %>%
select(species, year, flipper_length_cm)
```
You can read it as a series of imperative statements: filter, then mutate, then select. A good way to pronounce `%>%` when reading code is "and then". It takes the output of the operation to the left of `%>%` and feeds it into the next function as the input.
Say you want to summarize data by some specified group, for example you want to find the average body mass for each species, this is where the `group_by()` function comes into play.
```{r eval = FALSE}
penguins %>%
group_by(species) %>%
summarise(body_mass_avg = mean(body_mass_g, na.rm = TRUE))
```
Or get a count of how many individuals were observed for each species each year
```{r eval=FALSE}
penguins %>%
group_by(species, year) %>%
summarise(n_observations = n())
```
You can even shorten the above operation by using `count()` instead of `summarise`.
#### Exercises
(*not required, but useful if you want to work through them!*)
1. Reorder the variables in `penguins` so that `year` is the first column followed by the rest (Hint: look into the use of `everything()`).
2. Create a new column called 'size_group' where individuals with body mass greater than the overall average are called 'large' and those smaller are called 'small'.
3. Find out which year for each species had the largest average body mass.
4. You want to filter data for years that are *not* in a vector of given years, but this code doesn't work. Tweak it so that it does. (Yes, you could just filter year to equal 2007 in this case but there is a trouble-shooting lessons here).
```{r eval = FALSE}
penguins %>%
filter(year !%in% c(2008, 2009))
```
<hr>
## Read and Write Data
We used an R data package today to read in our data frame, but that probably isn't how you will normally read in your data.
There are many ways to read and write data in R. To read in .csv files, you can use `read_csv()` which is included in the Tidyverse with the {readr} package, and to save csv files use `write_csv()`. The {readxl} package is great for reading in excel files, however it is not included in the Tidyverse and will need to be loaded separately.