-
Notifications
You must be signed in to change notification settings - Fork 129
/
Copy pathR_Intro_Lesson.Rmd
953 lines (681 loc) · 36.6 KB
/
R_Intro_Lesson.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
---
title: "Introduction to R using Yeast RNAseq Metadata"
author: ANGUS
date: June 2019
output:
md_document:
variant: markdown_github
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
# This will set the default knitr options
knitr::opts_chunk$set(eval = FALSE, echo = TRUE, cache = FALSE, include = TRUE,
warning = FALSE, collapse = FALSE, message = FALSE, error = FALSE)
```
# Getting started
Start up a Jetstream m1.medium or larger
[as per Jetstream startup instructions](jetstream/boot.html).
---
### Using RStudio on Jetstream
We will be using R and RStudio on our cloud instances. We have pre-installed RStudio on the workshop image. Start up a Jetstream m1.medium or larger using the ANGUS 2019 image
[as per Jetstream startup instructions](jetstream/boot.html). Connect to your instance.
We will need a password to connect to RStudio. We can set the password to our instance by running the following commands in bash (the password will not be visible on the screen):
```{bash eval = FALSE}
sudo passwd $USER
```
We will also need our instance username to connect to RStudio. To determine what your username is, run:
```{bash eval=FALSE}
echo My username is $USER
```
Lastly, find the RStudio server interface Web address (by default in port 8787) by running the following command:
```{bash eval = FALSE}
echo http://(hostname):8787/
```
Now go to the web address printed to your terminal in your Web browser and log in with the username and password that you set above. You should see something that looks like:
![](_static/RStudio.png)
where you type your username and the password that you have set.
### Installing on your own computer:
We will be using R and RStudio in the cloud. To use RStudio on your computer, you can download R and RStudio. If you do not have R and RStudio installed:
- Download R from CRAN:https://cran.r-project.org/
- Go to RStudio download page and install the version that is compatible with your operating system: https://www.rstudio.com/products/rstudio/download/#download
This [link](https://datacarpentry.org/R-ecology-lesson/) provides futher instructions on
installing and setting up R and RStudio on your computer.
## Introduction to R and RStudio
R is a programming language and free software environment for statistical computing and
graphics. Many bioinformatics programs are implemented in R. RStudio is a graphical
integrated development environment (IDE) that displays and organizes more information
about R. R is the underlying programming language. Once you start RStudio, you see this
layout (it will be similar either on your instance or on your computer):
![](_static/RStudio_first.png)
Scripts are like a digital lab book. They record everything your run, as well as notes to
yourself about how and why you ran it. To start a new script you can click on the icon
that looks like a page with a '+' underneath File and select -->R script. This generates a
`.R` file. `.R` files are plain text files that use the file extension `.R` so that humans
can remember that the file contains R code.
![](_static/RStudio_create_R_script.png)
Although scripts are a record of our code, they also include comments about our code. These comments are very important as they remind your future self (and whoever you pass your script) why you did what you did. Think of comments like the notes you take in your lab book when you do experiments.
![](_static/RStudio_type_script.png)
When you execute a line of your script, it is sent to the console. While the script is a
permanent record of everything you ran, the console is the engine that runs those things.
You can run code from a script or the console, but it is only recorded in scripts.
There are two additional panes in RStudio:
1. **Environment/History/Connections pane**
- **Environment Pane**: This pane records the objects that are currently loaded in our RStudio environment. This includes any data object that we created and the basic information about them.
- **History Pane**: This has output similar to the `history` command that we used in the shell. It records all of the commands that we have executed in the R console.
2. **Files/Plots/Packages/Help/Viewer** pane provides
a space for us to interact with plots and files we create and with built-in help files in R.
- **Packages Pane**: A place to organize the packages within RStudio. Allows easy installation or loading (`library()`) of the packages we have.
And now we're ready to start the fun part!
## Basic concepts
### Objects
An R object is something that you create and assign a value or function to. You do this using the *assignment operator* which is `<-` (NOTE: shortcut Alt + -). For example you can assign a *numeric* value to an object by typing in your script (remember to comment what you are doing):
```{r}
a <- 34 # assign the number 34 to the object a
```
You can also assign a *character* value to an object:
```{r}
b <- 'mouse' # assign the word 'character' to the object b
```
To run a line of code, put your cursor on the line that you want to run and press `Enter+Cmd` or `Enter+Ctrl`. Or select the line and click the button that says `Run` on the right top corner of your script.
![](_static/first_lines.png)
Now check your Console and Environment tabs, you will see that the commands have run and that you have created new objects!
You can do all sorts of stuff with your objects: operations, combinations, use them in functions. Below we show how to perform multiplication using an object. Because `a` stores the value `34`, we can multiply it by 2:
```{r}
a2 <- a*2 # multiply a by 2
```
### Functions and Arguments
Functions are “canned scripts” that automate more complicated sets of commands including operations assignments, etc. Many functions are predefined, or can be made available by importing R packages (more on that later). A function usually takes one or more inputs called arguments. Functions often (but not always) return a value. A typical example would be the function `sqrt()`. The input (the argument) must be a number, and the return value (in fact, the output) is the square root of that number. Executing a function (‘running it’) is called calling the function. An example of a function call is:
```{r}
sqrt(a) ## calculate the square root of a
```
Here, the value of `a` is given to the `sqrt()` function, the `sqrt()` function calculates the square root, and returns the value which is then assigned to the object `b`. This function is very simple, because it takes just one argument.
The return ‘value’ of a function need not be numerical (like that of `sqrt()`), and it also does not need to be a single item: it can be a set of things, or even a dataset. We’ll see that when we read data files into R.
Similar to the shell, **arguments** can be anything, not only numbers or filenames, but also other objects. Exactly what each argument means differs per function, and must be looked up in the documentation (see below). Some functions take arguments which may either be specified by the user, or, if left out, take on a default value: these are called options. **Options** are typically used to alter the way the function operates, such as whether it ignores ‘bad values’, or what symbol to use in a plot. However, if you want something specific, you can specify a value of your choice which will be used instead of the default.
Let’s try a function that can take multiple arguments: `round()`.
```{r}
round(3.141592) ## round number
```
Arguments are the 'conditions' of a function. For example, the function `round()` has an argument to determine the number of decimals, and its default in R is 0. Here, we’ve called round() with just one argument, 3.14159, and it has returned the value 3. That’s because the default is to round to the nearest whole number. If we want more digits we can see how to do that by getting information about the round function. We can use args(round) to find what arguments it takes, or look at the help for this function using ?round.
```{r}
args(round) ## show the arguments of the function 'round'
?round ## shows the information about 'round'
```
We can change this to two decimals by changing the `digits` argument:
```{r eval = FALSE}
round(3.141592, digits = 2)
```
Note that we do not need to specify the name of the argument as R detects this by the position within the function:
```{r eval = FALSE}
round(3.141592, 2)
```
Though, it is preferablt to show the argument names. Plus, we have the added benefit of having the ability to change the order of the arguments if the arguments are named:
```{r eval = FALSE}
round(digits = 2, x= 3.141592)
```
It's best practice to put the non-optional arguments (the data that you want the function
to work on) first, and then specify the names of all optional arguments. This helps other
to interpret your code more easily.
### Vectors
A vector is the most common and basic data type in R, and is pretty much the workhorse of R. A vector is composed by a series of values, which can be either numbers or characters. We can assign a series of values to a vector using the c() function.
```{r}
yeast_strain <- c('WT','snf2')
concentration <- c(43.903, 47.871)
```
```{r eval = TRUE, echo = FALSE}
# To be able to load challenge question
yeast_strain <- c('WT','snf2')
concentration <- c(43.903, 47.871)
```
There are many functions to inspect a vector and get information about its contents:
```{r eval = FALSE}
length(yeast_strain) # number of elements in a vector
class(concentration) # type of object b
str(yeast_strain) # inspect the structure of a
```
### Data types
An **atomic vector** is the simplest R data type, and it is a vector of a single type. The 6 main atomic vector types that R uses are:
- **`'character'`** letters or words
- **`'numeric'`** (or `double`) for numbers
- **`'logical'`** for `TRUE` and `FALSE` (booleans)
- **`'integer'`** for integer numbers (e.g., `2L`, the `L` indicates to R that it’s an integer)
- **`'complex'`** to represent complex numbers with real and imaginary parts, for example `2 + 4i`
- **`'raw'`** for bitstreams that we will not use or discuss further
- **`factor`** represents categorical data (more below!)
You can check the type of vecor you are using with the `typeof()` function.
```{r}
typeof(concentration)
```
**Challenge**: Atomic vectors can be a character, numeric (or double), integer, and logical. But what happens if we try to mix these types in a single vector?
<details>
<summary><strong><font color="#6B33FF">Solution</font></strong></summary>
```{r challenge, eval = TRUE}
yeast_strain_concentration <- c(yeast_strain, concentration) # combine two vectors
yeast_strain_concentration
# Solution: All indices are "coerced" to the same "type", e.g. character.
```
</details>
As we saw in our challenge problem, objects of different types get converted into a single type. This is called **coercion** and it follows a hierarchy:
logical → numeric → character ← logical
R uses many **data structures**, the main ones are:
- **`vector`** contain elements of the same type in one dimension
- **`list`** can contain elements of different types and multiple data structures
- **`data.frame`** contains rows and columns. Columns can contain different types from other columns.
- **`matrix`** with rows and columns that are of the same type
- **`array`** stores data in more than 2 dimensions
We will use `vectors`, `matrices`, `data.frames`, and `factors` today.
### Subsetting
We can extract values from a vector based on position or conditions: this is called **subsetting**. One way to subset in R is to use `[]`. Indices start at 1.
```{r eval = FALSE}
strains <- c('WT', 'snf2', 'snf1', 'snf1_snf2')
strains[2] # subsets the second element
strains[c(2,3)] # subsets the second and third elements
more_strains <- strains[c(1, 2, 3, 1, 4, 3)] # to create an object with more elements than tha original one
more_strains
```
#### Conditional subsetting
We can also subset based on **conditions**, using a logical vector with `TRUE` and `FALSE` or with *operators*:
```{r eval = FALSE}
strains[c(TRUE, FALSE, FALSE, TRUE)]
concentration >= 43
concentration > 40
concentration < 47
concentration <= 47
concentration[ concentration < 45 | concentration > 47] # | is OR
concentration[ concentration < 45 & concentration == 47] # & is AND , == is mathematical 'equal'
more_strains[more_strains %in% strains]
```
### Missing data
As R was designed to analyze datasets, it includes the concept of missing data (which is
uncommon in other programming languages). Missing data are represented in vectors as NA.
There are in-built functions and arguments to deal with missing data. For example, this
vector contains missing data:
```{r eval = FALSE}
concentration <- c(43.903, 47.871, NA, 48.456, 53.435)
```
We can use the following functions to remove or ignore it:
```{r}
mean(concentration, na.rm = TRUE) # mean of concentration removing your NA
concentration[!is.na(concentration)] # extract all elements that are NOT NA
na.omit(concentration) # returns a 'numeric' atomic vector with incomplete cases removed
concentration[complete.cases(concentration)] # returns a 'numeric' atomic vector with elements that are complete cases
```
Note, if your data includes missing values, you can code them as blanks (""), or as "NA"
before reading them into R.
### Factors
Factors represent categorical data. They are stored as integers associated with labels and they can be ordered or unordered. While factors look (and often behave) like character vectors, they are actually treated as integer vectors by R. So you need to be very careful when treating them as strings.
```{r eval = FALSE}
strains <- factor(c('WT', 'snf2', 'snf1', 'snf1_snf2')) # We make our strains object a factor
class(strains)
nlevels(strains)
strains # R orders alphabetically
```
Once created, factors can only contain a pre-defined set of values, known as levels. By default, R always sorts `levels` in alphabetical order. We can reorder them using the
`levels` argument.
```{r eval=FALSE}
strains <- factor(strains, levels = c('WT', 'snf1', 'snf2', 'snf1_snf2')) # we reorder them as we want them
strains
```
We convert objects using `as.XXX` functions:
```{r eval = FALSE}
strains <- as.character(strains)
class(strains)
strains <- as.factor(strains)
class(strains)
```
And we can also **rename factors** based on position:
```{r eval = FALSE}
levels(strains)[4] <- 'wild_type'
```
## Starting with tabular data
This tutorial is a continuation from our *Salmon* lesson -- we are using sequencing
data European Nucleotide Archive dataset PRJEB5348. This is an RNA-seq experiment using
comparing two yeast strains, *SFF2* and WT. Although we're only working with 6 samples,
this was a much larger study. They sequenced 96 samples in 7 different lanes, with 45 wt
and 45 mut strains. Thus, there are 45 biological replicates and 7 technical replicates
present, with a total of 672 samples! The datasets were compared to test the
underlying statistical distribution of read counts in RNA-seq data, and to test which
differential expression programs work best.
We will get the information about the RNA quality and quantity that the authors measured
for all of their samples, and we will explore how the authors made their data avaialble
and 'linkable' between technical and biological replicates.
We are going to explore those datasets using RStudio, combining a both base R and an
umbrella package called *tidyverse*.
```{r echo = TRUE}
library('tidyverse')
```
```{r eval = TRUE, echo = FALSE}
library('tidyverse')
```
First we are going to read the files from a url and assign that to experiment_info,
using `read_tsv()`:
```{r}
experiment_info <- read_tsv(file = 'https://osf.io/pzs7g/download/')
```
Let's investigate your new data using some exploratory functions:
```{r eval = FALSE}
class(experiment_info) # type of object that sample_details is
dim(experiment_info) # dimensions
summary(experiment_info) # summary stats
head(experiment_info) # shows the first 6 rows and all columns
tail(experiment_info) # last 6 rows and all columns
str(experiment_info) # structure
```
What do you notice? The last 4 columns are empty! We can get rid of them by subsetting, and having a look that we have done things correctly:
```{r eval = FALSE}
sum(is.na(experiment_info$X10)) # checks how many NAs the column X10 has so that you know all your rows are NAs (or not!)
experiment_info <- experiment_info[ , 1:9] # subset all rows and columns 1 to 9
```
**Challenge:** How many columns are in the `experiment_info` data frame now?
<details>
<summary><strong><font color="#6B33FF">Solution</font></strong></summary>
```{r}
dim(experiment_info) # dimensions
```
</details>
The 9th column has no name, but we can change this using `rename()`:
```{r eval = FALSE}
experiment_info <- rename(experiment_info, units = X9)
```
### Manipulating data with **dplyr**
We can now explore this table and reshape it using `dplyr`. We will cover in this lesson:
- ` %>% `: pipes, they allow 'concatenating' functions
- `select()`: subsets columns
- `filter()`: subsets rows on conditions
- `mutate()`: create new columns using information from other columns
- `group_by()` and `summarize()`: create summary statistics on grouped data
- `arrange()`: sort results
- `count()`: counts discrete values
### Selecting columns and filtering rows
We have now a table with many columns -- but what if we don't need all of the columns?
We can select columns using `select`, which requires the names of the colums that you want to keep as *arguments*:
```{r eval = FALSE}
# Let's confirm the arguments by looking at the help page
?select # Notice how the help page is different from the shell manual pages?
# Now lets select specific columns
select(experiment_info, Sample, Yeast Strain, Nucleic Acid Conc., Sample, 260/280, Total RNA)
```
What happened? R cannot parse the spaces in the column names properly. We can add back commas around the `name of the column`. These tell R that `these words belong together`:
```{r}
select(experiment_info, Sample, `Yeast Strain`, `Nucleic Acid Conc.`, 260/280, `Total RNA`)
```
Also, R gets confused when column names start with a number. So, we will also need to specify where the numbers start and stop:
```{r eval = FALSE}
select(experiment_info, Sample, `Yeast Strain`, `Nucleic Acid Conc.`, `260/280`, `Total RNA`)
```
So let's store that in an object named `experiment_info`:
```{r eval = FALSE}
experiment_info <- select(experiment_info, Sample, `Yeast Strain`, `Nucleic Acid Conc.`, A260, A280, `260/280`, `Total RNA`)
```
**NOTE**: When choosing names from columns, make sure that you are consistent and choose something simple, all lowercase, all uppercase, or CamelCase. Having spaces in column makes it hard for R to deal with, and can lead to unexpected results or silent errors.
We select all columns *except for* by using `-` before the name of the columns we do not want to include, for example all columns except for A260 and A280:
```{r eval = FALSE}
# Remove two columns: A260 and A280
select(experiment_info, -A260, -A280)
```
The `filter` function works on rows, so we can also `filter` the data based on conditions in a column:
```{r eval = FALSE}
filter(experiment_info, `Nucleic Acid Conc.` > 1500)
```
```{r eval = TRUE, echo = FALSE}
# To allow ggplots below to match that of the learners for the exercises
experiment_info <- read_tsv(file = 'https://osf.io/pzs7g/download/')
experiment_info <- experiment_info[ , 1:9] # subset all rows and columns 1 to 9
experiment_info <- rename(experiment_info, units = X9)
experiment_info <- select(experiment_info, Sample, `Yeast Strain`, `Nucleic Acid Conc.`, A260, A280, `260/280`, `Total RNA`)
```
### Exercise 1
>Select the columns `Sample`, `Yeast Strain`, `A260` and `A280` and assign that to a new object called `experiment_data`.
<details>
<summary><strong><font color="#6B33FF">Solution-#1</font></strong></summary>
```{r exercise1, eval = TRUE, foldcode=TRUE}
experiment_data <- select(experiment_info, Sample, `Yeast Strain`, A260, A280)
head(experiment_data)
```
</details>
### Exercise 2
>Filter rows for only WT strains that had more than 1500 ng/uL concentration and make a new tibble called wild_type.
<details>
<summary><strong><font color="#6B33FF">Solution-#2</font></strong></summary>
```{r exercise2, eval = TRUE, foldcode=FALSE}
wild_type <- filter(experiment_info, `Yeast Strain` %in% 'WT' & `Nucleic Acid Conc.` > 1500)
head(wild_type)
```
</details>
### Pipes
What if we want to `select` and `filter` at the same time? We use a "pipe" in R, which is represented by `%>%`. **Note:** This is the R version of the shell command `|`!
Pipes are available via the `magrittr` package, installed automatically with `dplyr`. If you use RStudio, you can type the pipe with `Ctrl + Shift + M` if you have a PC or `Cmd + Shift + M` if you have a Mac as a shortcut to get to produce a pipe.
We can combine the two subsetting activities from Exercise 1 and 2 using pipes `%>%` :
```{r eval = FALSE}
experiment_info_wt <- experiment_info %>%
filter(`Yeast Strain` == 'WT' & `Nucleic Acid Conc.` > 1500) %>%
select(Sample, `Yeast Strain`, A260, A280)
```
### Exercise 3
>Using pipes, subset experiment_info to include all the samples that had a concentration less than 1500 and where total RNA was more or equal than 40 ug and retain only the columns that tell you the sample, yeast strain, nucleic acid concentration and total RNA. Assign that to a new table called samples_sequenced.
<details>
<summary><strong><font color="#6B33FF">Solution-#3</font></strong></summary>
```{r exercise3, eval = TRUE}
samples_sequenced <- experiment_info %>%
filter(`Nucleic Acid Conc.` < 1500 & `Total RNA` >= 40) %>%
select(Sample, `Yeast Strain`,`Nucleic Acid Conc.`,`Total RNA`)
samples_sequenced
```
</details>
### Mutate
What if I want to create a new column? I use the function `mutate` for this. Let's convert our units into micrograms/microliter:
```{r eval = FALSE}
experiment_info %>%
mutate(concentration_ug_uL = `Nucleic Acid Conc.` / 1000)
```
Or create more than one column! And pipe that into head so that we have a peep:
```{r eval = FALSE}
experiment_info %>%
mutate(concentration_ug_uL = `Nucleic Acid Conc.` / 1000,
half_concentration = concentration_ug_uL / 1000) %>%
head()
```
### Exercise 4
>Create a new table called library_start that includes the columns sample, yeast strain and two new columns called RNA_100 with the calculation of microliters to have 100ng of RNA and another column called water that says how many microliters of water we need to add to that to reach 50uL.
<details>
<summary><strong><font color="#6B33FF">Solution-#4</font></strong></summary>
```{r exercise4, eval = TRUE}
library_start <- experiment_info %>%
mutate(RNA_100 = 100/ `Nucleic Acid Conc.`,
water = 50 - RNA_100) %>%
select(Sample, `Yeast Strain`, RNA_100, water)
head(library_start)
```
</details>
### Exercise 5
>Pretty difficult to pipette! Can redo the table considering a 1:10 dilution of your samples? Include the columns of A260 and A280 values in addition to sample, yeast strain, RNA_100 and water.
<details>
<summary><strong><font color="#6B33FF">Solution-#5</font></strong></summary>
```{r exercise5, eval = TRUE}
library_start_diluted <- experiment_info %>%
mutate(diluted_RNA = 10/ `Nucleic Acid Conc.`,
water = 50 - diluted_RNA) %>%
select(Sample, `Yeast Strain`, `A260`, `A280`, diluted_RNA, water)
head(library_start_diluted)
```
</details>
### Exercise 6
>Based on the tibble library_start_diluted, create a new tibble called `seq_samples` that includes only samples with a A260/A280 ratio of 2.2 or more and that only contains the columns for sample, yeast strain, A260280 ratio, diluted RNA and water.
<details>
<summary><strong><font color="#6B33FF">Solution-#6</font></strong></summary>
```{r exercise6, eval = TRUE}
seq_samples <- library_start_diluted %>%
mutate(A260_A280 = `A260`/`A280`) %>%
filter(A260_A280 >= 2.2) %>%
select(Sample, `Yeast Strain`, A260_A280, diluted_RNA, water)
head(seq_samples)
```
</details>
### Split-apply-combine
The approach of split-apply-combine allows you to *split* the data into groups, *apply* a function/analysis and then *combine* the results. We can do this usign `group_by()` and `summarize()`.
`group_by()` takes as *argument* the column names that contain categorical variables that we want to calculate summary stats for. For example, to determine the average RNA concetration per strain:
```{r eval = FALSE}
experiment_info %>%
group_by(`Yeast Strain`) %>%
summarize(mean_concentration = mean(`Nucleic Acid Conc.`))
```
or summarize using more than one column:
```{r eval = FALSE}
experiment_info %>%
group_by(`Yeast Strain`) %>%
summarize(mean_concentration = mean(`Nucleic Acid Conc.`),
mean_total_RNA = mean(`Total RNA`))
```
**NOTE**: Our table does not have missing values. However, we have built-in functions in
R that can deal with this. We can filter values that are *not* NA:
```{r eval = FALSE}
experiment_info %>%
group_by(`Yeast Strain`) %>%
summarize(mean_concentration = mean(`Nucleic Acid Conc.`, na.rm = TRUE), # remove NA values
mean_total_RNA = mean(`Total RNA`))
experiment_info %>%
filter(!is.na(`Nucleic Acid Conc.`)) %>% # filter out the values that are not NAs
group_by(`Yeast Strain`) %>%
summarize(mean_concentration = mean(`Nucleic Acid Conc.`),
mean_total_RNA = mean(`Total RNA`))
```
We can now arrange the data in ascending order (default of `arrange()`) or descending using `desc()`:
```{r eval = FALSE}
experiment_info %>%
filter(!is.na(`Nucleic Acid Conc.`)) %>% # filter out the values that are not NAs
group_by(`Yeast Strain`) %>%
summarize(mean_concentration = mean(`Nucleic Acid Conc.`),
mean_total_RNA = mean(`Total RNA`)) %>%
arrange(mean_concentration) # arrange new table in ascending mean concentrations
experiment_info %>%
filter(!is.na(`Nucleic Acid Conc.`)) %>% # filter out the values that are not NAs
group_by(`Yeast Strain`) %>%
summarize(mean_concentration = mean(`Nucleic Acid Conc.`),
mean_total_RNA = mean(`Total RNA`)) %>%
arrange(desc(mean_concentration) # arrange new table in descending mean concentrations
```
Another useful function is `count()` to count categorical values:
```{r eval = FALSE}
experiment_info %>%
count(`Yeast Strain`)
```
### Combining Datasets using **join**
On many ocassions we will have more than one table that we need to extract information from. We can easily do this in R using the family of `join` functions.
We are going to first download extra information about the samples that we are are investigating and assigning them to an object called *ena_info*:
```{r}
ena_info <- read_tsv(file = 'https://osf.io/6s4cv/download')
sample_mapping <- read_tsv(file = 'https://osf.io/uva3r/download')
```
```{r echo = FALSE, eval = TRUE}
ena_info <- read_tsv(file = 'https://osf.io/6s4cv/download')
```
### Exercise 7
>Explore your new dataset ena_info and sample_mapping. What commands can you use?
<details>
<summary><strong><font color="#6B33FF">Solution-#7</font></strong></summary>
```{r exercise7, eval = TRUE}
head(ena_info)
tail(ena_info)
str(ena_info)
class(ena_info)
dim(ena_info)
summary(ena_info)
# and the same with sample_mapping!
```
</details>
The `join` functions allow you to merge tables on columns/rows characteristics, so that you can do cool stuff such as (taken from R cheatsheets!):
![](_static/join.png)
We have run accession numbers in both our tibbles `sample_mapping` and `ena_info`. We want to merge both datasets to link the information present in both tables in one big table that contains everything. We need a column that has the same name in both tables.
**Challenge**: Which function would you use if the column names were not the same?
<details>
<summary><strong><font color="#6B33FF">Solution</font></strong></summary>
```{r}
sample_mapping <- rename(sample_mapping, run_accession = RunAccession) # would rename a column called Sample into sample_number to match with the column sample_number in ena_info
yeast_metadata_right <- right_join(ena_info, sample_mapping, by = "run_accession") # this will join both tables matching rows from experiment_info in ena_info
```
</details>
```{r eval = TRUE, echo = FALSE}
sample_mapping <- read_tsv(file = 'https://osf.io/uva3r/download')
sample_mapping <- rename(sample_mapping, run_accession = RunAccession) # would rename a column called Sample into sample_number to match with the column sample_number in ena_info
```
That is a big table!
### Exercise 8
>How would you merge both tables so that only the rows that are common between both tables are preserved?
<details>
<summary><strong><font color="#6B33FF">Solution-#8</font></strong></summary>
```{r exercise8, eval = TRUE}
yeast_metadata_inner <- inner_join(ena_info, sample_mapping, by = "run_accession")
head(yeast_metadata_inner)
yeast_metadata_left <- left_join(ena_info, sample_mapping, by = "run_accession")
head(yeast_metadata_left)
```
</details>
We will work from now onwards with the tibble `yeast_metadata_inner`.
### Exercise 9
>We do not want all the columns; we want to create a new tibble called `yeast_metadata` that contains only run accession, experiment alias, read count, fastq bytes and md5, lane, yeast strain and biological replicate. Rename the column names so that all of the column names are lower_case (lowercase followed by underscore). And include only data from lane number 1. Use pipes!
<details>
<summary><strong><font color="#6B33FF">Solution #9</font></strong></summary>
```{r exercise9, eval = TRUE}
yeast_metadata <- yeast_metadata_inner %>%
rename(yeast_strain = Sample, lane = Lane, biol_rep = BiolRep) %>%
filter(lane == 1) %>%
select(run_accession, experiment_alias, read_count, fastq_bytes, fastq_md5, lane, yeast_strain, biol_rep)
head(yeast_metadata)
```
</details>
We also want to create a table called `salmon_samples` that will include our 6 samples that we used in `salmon`, namely samples 'ERR458493','ERR458494','ERR458495','ERR458500','ERR458501'and 'ERR458502'.
```{r eval = FALSE}
samples <- c('ERR458493', 'ERR458494', 'ERR458495', 'ERR458500', 'ERR458501', 'ERR458502') # create a vector that includes the samples that we want to subset
salmon_samples <- yeast_metadata_inner %>%
filter(run_accession %in% samples) # filter rows based on these sample names
```
## Making plots with ggplot2
We have learned how to manipulate datasets and combine them. We are now going to the
basics of constructing a plot in `ggplot2` which is part of `tidyverse`.
The basic *syntax* of `ggplot2` is:
```{r eval = FALSE}
ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) +
<GEOM_FUNCTION>()
```
We are going to explore our `yeast_metadata` tibble.
We build plots in 'layers':
```{r eval = TRUE}
ggplot(data = yeast_metadata)
```
Specifying `data` binds the plot to a specific data frame.
```{r eval = TRUE}
ggplot(data = yeast_metadata, mapping = aes(x = read_count, y = fastq_bytes))
```
defines the mapping using `aes` (aesthetics of the plot) by selecting the variables to be plotted and how they will be plotted (e.g. x/y, size, shape, color...)
```{r eval = TRUE}
ggplot(data = yeast_metadata, mapping = aes(x = read_count, y = fastq_bytes)) +
geom_point()
```
which sets what type of plot we want to have (boxplot, lines, bars):
- `geom_point()` for scatter plots, dot plots, etc.;
- `geom_boxplot()` for boxplots;
- `geom_line()` for trend lines, time series, etc.
We can modify plots to extract more information. We can add transparency:
```{r eval = TRUE}
ggplot(data = yeast_metadata, mapping = aes(x = read_count, y = fastq_bytes)) +
geom_point(alpha = 0.1)
```
or change the color of the points:
```{r eval = TRUE}
ggplot(data = yeast_metadata, mapping = aes(x = read_count, y = fastq_bytes)) +
geom_point(alpha = 0.1, color = "red")
```
Or color each strain differently:
```{r eval = TRUE}
ggplot(data = yeast_metadata, mapping = aes(x = read_count, y = fastq_bytes)) +
geom_point(alpha = 1, aes(color = yeast_strain))
```
We can also specify the color inside the mapping:
```{r eval = TRUE}
ggplot(data = yeast_metadata, mapping = aes(x = read_count, y = fastq_bytes, color = yeast_strain)) +
geom_point(alpha = 1)
```
or try different geom layers:
```{r eval = TRUE}
ggplot(data = yeast_metadata, mapping = aes(x = read_count, y = fastq_bytes, color = yeast_strain)) +
geom_jitter(alpha = 1)
```
We can use boxplots to see the distribution of reads within strains:
```{r eval = TRUE}
ggplot(data = yeast_metadata, mapping = aes(x = yeast_strain, y = read_count)) +
geom_boxplot()
```
This is useful, but if we add dots to the boxplots we will have a better idea of the number of measurements and their distribution:
```{r eval = TRUE}
ggplot(data = yeast_metadata, mapping = aes(x = yeast_strain, y = read_count)) +
geom_boxplot(alpha = 0.1) +
geom_jitter(alpha = 1, color = "tomato")
```
### Exercise 10
>Replace the boxplot with a violin plot.
<details>
<summary><strong><font color="#6B33FF">Solution-#10</font></strong></summary>
```{r exercise10, eval = TRUE}
ggplot(data = yeast_metadata, mapping = aes(x = yeast_strain, y = read_count)) +
geom_violin(alpha = 0.1) +
geom_jitter(alpha = 1, color = "tomato")
```
</details>
### Exercise 11
>Represent the read_count on log10 scale. Check out `scale_y_log10()`.
<details>
<summary><strong><font color="#6B33FF">Solution-#11</font></strong></summary>
```{r exercise11, eval = TRUE}
ggplot(data = yeast_metadata, mapping = aes(x = yeast_strain, y = read_count)) +
scale_y_log10() +
geom_boxplot(alpha = 0.1) +
geom_jitter(alpha = 1, color = "tomato")
```
</details>
### Exercise 12
>Try to make a histogram plot for the read counts, coloring each yeast strain.
<details>
<summary><strong><font color="#6B33FF">Solution-#12</font></strong></summary>
```{r exercise12, eval = TRUE}
# Basic histogram
ggplot(data = yeast_metadata, aes(x=read_count)) +
geom_histogram()
# Change colors
p<-ggplot(data = yeast_metadata, aes(x=read_count)) +
geom_histogram(color="black", fill="white")
p
# Change colors based on yeast strain
ggplot(data = yeast_metadata, aes(x=read_count, fill = yeast_strain)) +
geom_histogram(color="black")
# Facet based on yeast strain
ggplot(data = yeast_metadata, aes(x=read_count, fill = yeast_strain)) +
geom_histogram(color="black") +
facet_grid(yeast_strain~.)
# Change to custom colors that are color blind friend
ggplot(data = yeast_metadata, aes(x=read_count, fill = yeast_strain)) +
geom_histogram(color="black") +
facet_grid(yeast_strain~.) +
# Add blue and yellow colors that are more colorblind friendly for plotting
scale_fill_manual(values = c("cornflowerblue", "goldenrod2"))
# Density based on yeast strain
ggplot(data = yeast_metadata, aes(x=read_count, fill = yeast_strain)) +
facet_grid(yeast_strain~.) +
geom_density(alpha = 0.5) +
scale_fill_manual(values = c("cornflowerblue", "goldenrod2"))
# A white background might be preferable for the yellow color
ggplot(data = yeast_metadata, aes(x=read_count, fill = yeast_strain)) +
facet_grid(yeast_strain~.) +
geom_density(alpha = 0.5) +
scale_fill_manual(values = c("cornflowerblue", "goldenrod2")) +
theme_bw()
```
</details>
### Exercise 13
>What if we want to add the mean read counts in a vertical line?
<details>
<summary><strong><font color="#6B33FF">Solution #13</font></strong></summary>
```{r exercise13, eval = TRUE}
# To do so, we need to calculate the mean_read_count in a new data frame
mean_yeast_data <- yeast_metadata %>%
group_by(yeast_strain) %>%
summarise(mean_read_count = mean(read_count))
head(mean_yeast_data)
# Add the mean read_count with vline
ggplot(data = yeast_metadata, aes(x=read_count, fill = yeast_strain)) +
geom_density(alpha = 0.5) +
facet_grid(yeast_strain~.) +
geom_vline(data = mean_yeast_data, mapping = aes(xintercept = mean_read_count),
linetype="dashed", size=0.5) +
theme_bw() +
scale_fill_manual(values = c("cornflowerblue", "goldenrod2"))
```
</details>
## Acknowledgements
This lesson was inspired by and adapted from the [Data Carpentry Ecology R lessons](https://datacarpentry.org/R-ecology-lesson/index.html)
# More material
There are many amazing resources and cheat sheets to continue learning R, including:
- [R Cookbook](http://www.cookbook-r.com/)
- [Data Wrangling Cheat Sheet](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)
- [ggplot](https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf)
- [R Colors](http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf)
- [Software Carpentry R Lesson](http://swcarpentry.github.io/r-novice-gapminder/)