-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path03-method.Rmd
430 lines (328 loc) · 30.1 KB
/
03-method.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
# Somwehere between basic and useful
## Addressing
### Vectors
When you deal with variables you often will want to use only a part of it in your work. Other times you will want to get rid of some values which follow certain criteria. In order to do it you need to know an 'address' of particular value in a *vector*, *data frame*, *list* etc. From now on, I will use some functions while describing it *at hoc*, since I believe the best way to learn them, is to use them. Lets create a *vector* and see what we can do with it.
```{r echo = T, eval = T}
addressVec <- seq(from = 1, to = 20, by = 2)
addressVec
```
Instead of typing all the numbers by hand, we can use `seq()` function, to generate it automatically. This function takes two arguments `from` and `to`, however additionally we can use option `by` which defines increment of the sequence. As your knowledge is growing I can tell you a secret. Often, there is no need to name all the arguments and options in functions body. In the beginning you should use the names, but the more experienced you get, you will notice that you omit them often. Actually we nearly always omit so called *default* options of a function. Use `?seq` to check the help page for this function. You will see that it can take more options than `from`, `to` and `by`, but since they are predefined we don't need to bother and type them (as long as we are OK with *default* settings). So, lets retype our *variable* definition:
```{r echo = T, eval =T}
addressVec <- seq(1, 20, 2)
addressVec
```
The results are identical. Lets get back to addressing issues. Suppose you want to check the fifth element of our `addressVec` variable. To do it we use variable name followed by element number in brackets (or square brackets, if you will).
```{r echo = T, eval =T}
addressVec[5]
```
If you want to get rid of fifth element just precede it with `-`.
```{r echo = T, eval =T}
addressVec[-5]
```
Easy. The thing that is worth to mention right now is that **R Core Team** have reason and dignity of human beings so they start numeration of elements with value 1. In some other languages, because of no sensible reasons, numeration starts with 0 -- which is horribly annoying.
What to do if we want to extract values of more than one element?
```{r echo = T, eval =T, error = T}
addressVec[1,5]
addressVec[-1,-5,-6]
```
Error occurs with warning that there is incorrect number of dimensions. That's because *vectors* have only one dimension -- length. Within square brackets comma separates dimensions. So when we used command `[1,5]` we told R to look for value that address is number 1 in first dimension and number 5 in second dimension. To correct the mistake, we need to provide a vector of numbers from first dimension. We can use `c()` function to create *ad hoc vector* inside square brackets.
```{r echo = T, eval =T, error = T}
addressVec[c(1,5)]
addressVec[-c(1,5,6)]
```
Now something more complicated. Try to figure out what code below does, and what is the result of `which()` function:
```{r echo = T, eval =T, error = T}
addressVec[which(addressVec >= 6)] <- 'o..O'
addressVec
```
If you have problems, you can think of this example as a composition of actions. First there is expression `addressVec >=6`, than we use `which()` function with argument from previous expression. Next the result is passed as an address. Last thing is assignment of new value to...
### Data frames (and matrices)
When dealing with *data frames*, addressing is possible in two ways. First is similar to *vector* addressing. However, as *data frames* have two dimensions, you will need to express them like `df[1, 2]` -- which tells **R** to look up the cell in row 1, column 2. But what if you want to check all the cells in particular row or column? Omit the number, but not the coma -- like this: `df[, 3]`. It will display all row values from column 3. And of course you can still use *vectors* when addressing. This way also works for *matrices*.
```{r echo = T, eval =T, error = T}
addressDF <- data.frame(C1 = seq(1,10), C2 = seq(11,20), C3 = seq(21,30))
addressDF[5, 2]
addressDF[6, ]
addressDF[c(6, 7), 1:3]
```
The second option is to use `$` sign followed with column name (and it works only for *data frames* and *lists*). This way, we tell **R** to look in whole column. If we want to display values from particular cells, we put them in brackets after column name, the same way as we do for *vectors*:
```{r echo = T, eval =T, error = T}
addressDF$C2[5]
addressDF$C3[c(6:9)]
```
When working with *matrices* you will find a lot of functions that apply to them (check help page for `base` package). The one you should familiarize with, if you plan on working with *matrix* operations are:
* `dim()` -- returns dimensions of *matrix*
* `rownames()` and `colnames()` -- to specify row and column names
* `cbind()` and `rbind()` -- to add columns or rows to a *matrix* (respectively)
* `lower.tri()`, `upper.tri()` -- lower and upper triangular part of *matrix*
* `t()` -- *matrix* transposition
* `outer()` -- outer products of *matrix*
* `rowsums()`, `rowmeans()`. `colsums()` and `colmeans()` -- tool to calculate sum and mean of rows or columns
* `%*%` -- *matrix* multiplication operator (`matrixOne %*% matrix2`)
* `diag()` -- diagonals of a *matrix*
You can also check *help* for `Matrix` package which contains dozens of functions to deal with *matrix* algebra and *matrix* transformations.
### Lists
As mentioned before each *list* element can have different structure. Thus, addressing is kind of a mixture of all above. First you need to address element of your *list*. You do it with double brackets containing element number following name of variable. You can also use `$` sign with name of the element. Than you use addressing the same way you address *data frames*, *vectors* or *matrices*.
```{r echo = T, eval =T, error = T}
addressList <- list(x1 = c('a', 'b'), x2 = 1:4, x3 = matrix(c(1:6), nrow = 2))
addressList[[2]]
addressList$x2[3]
addressList[[3]]
addressList[[3]][, 3]
```
OK, now simple task for you. There is a function called `colnames()` which allow us to change names in *matrix* variable. To use it you need to put name of *matrix* variable in the parentheses (like this: `colnames(x)`), and assign value to it (e.g. `colnames(x) <- c('one', 'two')). Now change names of of columns in *matrix* that is element of list above to `I`, 'Love', 'R'.
## Control flow
Controlling flow of function, script or whole program is the essence of programming. In general we can divide procedures in two big categories: **conditional control** and **iteration control**. In the first category we will find `if...else` statements. They change execution of code depending on state of input or previous results. In the second category we will find tools to repeatedly do execute some code. Typically tools to control repeating of code are `for` and `while` loops. However, in R there are special functions, which are basically `for` loops but have slightly better performance, and are shorter to write. We call them `apply` functions family.
### `if...else` and `ifelse` statments
The simplest way to control how the code is executed are conditions. In an abstract form it is something like telling your machine: if the state is red make it blue, else make it green. The most general form it is written like below:
```{r echo = TRUE, eval = FALSE}
if (x) { #writting just X without any condtition is evaluated as x == TRUE
function(x)
}
####rest of the code####
#Or...
if (x) {
function(x)
} else { #in this example this block will be evaluated if x == FALSE
function(y)
}
```
As you see, writing conditionals in their basic form is quite easy. If our script can be in just on of two different states while it start evaluating conditions both of above examples will be suitable. In practice it happens quite often that we need to plan different actions because of dichotomy in the results of previous actions. If our code in this place is going to make some simple things you can consider using `ifelse()` shortcut. It works like below:
```{r}
coinToss <- sample(c(0,1), 3, replace = T)
coinToss
ifelse(coinToss > 0, 'Win!', 'Looser!')
```
But what if your script can have more than two states when you start evaluating conditions? You will need to use `else if()` structure. A simple example is below:
```{r}
coinToss <- sample(c(0,1), 3, replace = T)
sumOfCoins <- sum(coinToss)# we can have four states here 0,1,2,3
sumOfCoins
if (sumOfCoins == 0) {
'Looser'
} else if (sumOfCoins == 1) {
'Barely alive'
} else if (sumOfCoins >= 2) {
'Big Winner!!!'
}
#Or because the functions realting to particular conditions are simple, you
#can make nested ifelse
ifelse(sum(sumOfCoins) == 0, 'Looser',
ifelse(sum(sumOfCoins) == 1, 'Barely alive', 'Big Winner!!!'))
```
### `for` and `while` loops
In many cases we want to use the same function over data stored in different *vectors*, columns or data sets. Other time we need to make sequence of computations in which every result depends on the previous one (like, for instance, in [Fibonacci sequence](https://en.wikipedia.org/wiki/Fibonacci_number)). In yet another situation we need to run function over data set, but only till we reach specific value. Of course, we can do everything manually, by copy-pasting code, taping *ctrl+enter* several times and so on. In every programming language we will find tools that help us automatize this procedures, and by side effect make them less prone to error. Those are called loops. Namely, the most common loops used are `for` and `while`. The difference between them, is that in first one we define number of iterations and in the second one we define condition that must be met so the loops is executed. In other words `for (i in 1:10) {x + i}` means that numbers from 1 to 10 will be added to x in a sequence. `while (x < 5) {print(x <- x - 1)}` on the other hand will subtract value `1` from `x` as long as `x` is greater than `0`. You thing that is worth to mention is that `while` loops are general form and `for` loops are special cases. It means that it is possible to rewrite any `for` loop into `while` loop, but not the other way.
Also when working in loops you should always preallocate memory for your output. In simple words it means, that if you are writing loop, that grows the vector instead of just substitute values, **R** needs to copy your previous results in each iteration. Thus making a *vector* of length of your number of iterations (or a *list*) before you start a loop will save you some computational time.
To illustrate how it works in practice lets look on some simple examples:
```{r}
exampleDF <- data.frame(norm = rnorm(100, 5, 1.5),
chisq = rchisq(100, 3),
pois = rpois(100, 3.33))
distroMeans <- vector('list', length(exampleDF)) #prealocate space
for (i in seq_along(exampleDF)) {
distroMeans[[i]] <- summary(exampleDF[[i]])
}
distroMeans
```
Above example is very simple, and actually can be made without loop. Try to call from your console following command `summary(exampleDF)`. This is called vectorization which will be explained few paragraphs below.
But lets get back to loops and write something that might be useful (yet very simple). Assume that you are investigating animals reproduction. You have a herd that counts 60 animals. The reproduction factor is `1.34`, and survival factor is '0.75'. So the overall growth factor can be calculated as *reproduction factor* \* *survival factor*. In each generation there is some noise in reproduction, i.e. from -3 to 2 animals are born with equal probability. This means, that number of animals in each generation directly depends of the outcome of previous generation. We want to make a chart (Fig. \@ref(fig:popPlot)) to visualize population dynamics in `100` generation, thus we decide to use `while (time < 101)` loop.
```{r popPlot, fig.cap = 'Population dynamics', fig.height = 3.5}
#define initial parameters
N_0 <- 100
rf <- 1.34
sf <- 0.75
gf <- rf*sf
n_gen <- 1:101
#we could repeat this inside loop, however we know the number of iterations
#thus calculating noise vector outside loop is better for performance
noise <- sample(-3:2, 100, replace = T)
t <- 1
geners <- vector('integer', length = length(n_gen))
geners[1] <- N_0
while (t < 101) {
geners[c(t + 1)] <- floor((geners[t] * gf + noise[t]))
t <- t + 1
}
plot(n_gen, geners, type = 'l',
main = 'Population dynamics',
xlab = 'generation', ylab = 'number of animals',
col = 'blue', lty = 3)
```
It appears that we found a way to make a species extinct...
### `apply` function family
For simple loops there is a shortcut. Namely, there is an `apply` function family that performs defined function over vector, list, matrix etc. Probably your need to use it will be reduced to `lapply()` and `sapply()` function in the beginning, thus I will focus on them here. From my experience it is better to first understand and correctly use `for` and `while` loops, than switch to `lapply()` and `sapply()` and than learn how to correctly use functions like `tapply()`, `mapply()`, `vapply()`. `sapply()` generally is the same as `lapply()`, the only difference is that the later function results in *list* of the same length as first argument. Each element of list contains results of function that was applied to the first argument. Simplified version tries to simplify results to *vector*, *matrix* or *array*. One tricky thing is that we we want to use function that takes more arguments than just our arguments of values over which we want to iterate we need to use a bit different syntax. In general, lets say our list is named as `X`, our function is `SomeFunction`, which takes arguments `A` and `B` (`SomeFunction(A, B)). Lets asume that each element of `X` is used as an argument `B` in consequent iterations, while `A` is fixed and equals `10`. The correct for of using lapply (or sapply) will be: `lapply(X, SomeFunction, A = 10)`.
```{r}
testApply <- list(x = 1:10, b = rep(c(4,5), times = 5), c = rpois(5, 25))
testLapply <- lapply(testApply, mean)
testSapply <- sapply(testApply, mean)
#Here we combine previous results with another lapply,
#and use function with multiple arguments.
lapply(testSapply, rpois, n = 15)
head(sapply(testSapply, rpois, n = 15))
```
It is also possible to use so called anonymous functions (simplifying -- function that does not have name and exists temporally) as an argument.
```{r}
testApplyFun <- 1:5
lapply(testApplyFun, function(x) x + 0.5)
sapply(testApplyFun, function(x) x + 0.5)
```
On the basis of previous code that calculated population dynamics, we can write general function and use it with `lappy()` to estimate different population dynamics depending on initial population size. Than, by addition of consequent lines representing populations with different starting points we create plot for all 4 populations (Fig. \@ref(fig:lapplyPopPlot)).
```{r lapplyPopPlot, fig.cap = 'Population dynamics with different starting point', fig.height = 3.5}
popDyn <- function(N_0, rf = 1.34, sf = 0.75, n_gen) {
gf <- rf*sf
noise <- sample(-3:2, 100, replace = T)
t <- 1
geners <- vector('integer', length = n_gen)
geners[1] <- N_0
while (t < 101) {
geners[c(t + 1)] <- floor((geners[t] * gf + noise[t]))
t <- t + 1
}
return(geners)
}
N_0 <- c(50, 100, 150, 200, 250)
popSizes <- lapply(N_0, popDyn, n_gen = 101)
n_gen = 101
plot(1:n_gen, popSizes[[5]], type = 'l',
main = 'Population dynamics',
xlab = 'generation', ylab = 'number of animals',
col = 'blue', lty = 3, ylim = c(-10, max(popSizes[[5]])))
lines(1:n_gen, popSizes[[4]], col = 'green')
lines(1:n_gen, popSizes[[3]], col = 'red')
lines(1:n_gen, popSizes[[2]], col = 'yellow')
lines(1:n_gen, popSizes[[1]], col = 'black')
```
## Operation on Vectors
Old proverb:
> *The power of R is vectorization*
Indeed, vectorization is one of the most useful features of **R** environment. In short words, many functions are constructed in such a manner, that one can avoid using loops (since, they are often very slow). When you begin your journey with programming, you will probably not notice the difference in computation speed between *vectorized way* and *loop way*. Nonetheless, what will you find attractive, is that in many cases writing in vectorized form feels more natural. So how it works? Most basic functions are *written and compiled* in very fast, low level languages like **C** or **FORTRAN**. It makes computation way faster, but still we use easy **R** syntax. Instead of running the *precompiled* function on each of the elements of vector we actually can pass the vector into function -- and **R** will know what to do. The speed is achieved because when we use function on each element, **R** needs to figure out what to do several (or even hundred or thousands of) times. However, when we pass whole vector into function, it needs to find out what to do only once.
Sometimes, we need to use some other functions, which are not meant to process vectors (or you want to use some arbitrary functions on *matrix*, *data frame* or *list*). In those cases you will need to parse your data more than once to function. In **R** you can do it by using loops or by using so called `apply` family functions. When I was learning **R** it was most confusing thing to me, however once you get it, they become fairly easy to use. The most important difference for you as a beginner is that when using loops, you should make some memory allocation -- in human language, before you run loop, you should create vector which will store results. The loops achieve highest speed when your vector can fit all the values from loop. If you do not know how big your vector will be, you need to relay on **R** ability to re-allocate the memory, which is **S L O W**. `apply` family hopefully takes care of this problem, so every time you know how to -- use it. It will save you time and frustration of using loops. I will cover basic use of this functions later on examples.
## Randomization and distribution
Random sampling is quite easy task. There is nice `sample()` function, that allow you to do that. You can sample from *numeric*, *integer* or even *character* *vectors*. You can even assign probabilities to particular elements of vector (we will not need that at the moment). So lets make some fun and make virtual casino. We will take two dice and will roll it 1000 times. If the sum of each result is odd and smaller than `5` we will earn 30EUR. If it is odd and higher than `5`, we will get 50EUR. However if the sum is even number we loose 70EUR. Lets see if we can beat casino...
First lets define our dice and rolls.
```{r echo = T, eval =T, error = T}
niceDice <- seq(1,6)
rollDiceOne <- sample(niceDice, 1000, replace = T)
rollDiceTwo <- sample(niceDice, 1000, replace = T)
```
OK, you heard about power of vectorization already, so lets sum up both vectors:
```{r echo = T, eval =T, error = T}
sumDiceRolls <- rollDiceOne + rollDiceTwo
```
Next lets make some use of knowledge we got and substitute our results with some cash...
```{r echo = T, eval =T, error = T}
cashDiceRolls <- sumDiceRolls
cashDiceRolls[which(cashDiceRolls %% 2 == 1 & cashDiceRolls <= 5)] <- 30
cashDiceRolls[which(cashDiceRolls %% 2 == 1 & cashDiceRolls > 5)] <- 50
cashDiceRolls[which(cashDiceRolls %% 2 == 0)] <- -70
sum(cashDiceRolls)
```
OK. What happened? We put our substitutions in very wrong order... After first two substitutions we have only even numbers in our vector... So how to fix it? Start with assigning the highest absolute value, and after all change it to negative one.
```{r echo = T, eval =T, error = T}
cashDiceRolls <- sumDiceRolls
cashDiceRolls[which(cashDiceRolls %% 2 == 0)] <- 70
cashDiceRolls[which(cashDiceRolls %% 2 == 1 & cashDiceRolls > 5)] <- 50
cashDiceRolls[which(cashDiceRolls %% 2 == 1 & cashDiceRolls <= 5)] <- 30
cashDiceRolls[which(cashDiceRolls == 70)] <- -70
sum(cashDiceRolls)
```
Now, that's not very nice... We lost a lot of cash... But we can also track how our luck changed. Maybe we were above `0` for a while, before things got South? Or maybe we were doomed since the beginning? Lets make a plot (Fig. \@ref(fig:betaDensity-plot)) to evaluate our progress.
```{r echo = FALSE, cassino-plot, fig.cap = 'How to lose cash in casino', out.width='60%', fig.align= 'center'}
plot(seq(1,1000),cumsum(cashDiceRolls), type = 'l')
```
I think that your love to **R** is being deeper since I showed you how it can save a lot of your money!
In nowadays stochastic analyses are more, and more popular. With **R** it is quite easy to generate random numbers, sample and do all the *mumbo jumbo* on data. There are few functions that cover *classical* distributions: normal, Poisson, binomial, uniform (and few others), that are actually build in base **R** distribution (full list you will find [here](https://stat.ethz.ch/R-manual/R-devel/library/stats/html/Distributions.html)). Some more sophisticated stuff is usually covered by some packages you will need to download by yourself. You will easily find them by querying Google like this [Pearson distribution in R](https://www.google.com/search?client=ubuntu&channel=fs&q=Pearson+distribution+in+R&ie=utf-8&oe=utf-8&gfe_rd=cr&dcr=0&ei=KTI5Ws_7LfPBXrSBr7gO). By using this technique it is also possible to find other useful packages for dealing with distributions (try to search for package that allows you to sample from trimmed distribution).
All the distribution packages follow the same schema when naming functions. They use letters: `d`, `p`, `q` and `r` followed by abbreviation of distribution name to generate: density, distribution function, quantile function and random numbers -- respectively. For instance, lets generate ten random numbers from *beta distribution*, with parameters `10` and `3`:
```{r echo = T, eval =T, error = T}
rbeta(25, 10, 3)
```
Or we can make this nice plot (Fig. \@ref(fig:betaDensity-plot)) for density of *beta distribution* with parameters `8` and `3`:
```{r echo = FALSE, betaDensity-plot, fig.cap = 'Example of beta distribution', out.width='60%', fig.align= 'center'}
plot(seq(0, 1, length = 300), dbeta(seq(0, 1, length = 300), 8, 3), type = 'l')
```
## `tidyverse` idea and `dplyr` library
The whole idea of tidy data comes from one of most famous **R** developers -- [Hadley Wickham](http://hadley.nz/). In on of his papers [@hadley2014] he described procedures for generating and cleaning data in standardized manner. Many packages right now are designed to work best with data structured according to this publication. Eventually it lead to `tidiverse` -- tools tailored for data science with common syntax and philosophy [@R-tidyverse].
One of the most useful packages that are included in `tidiverse` [@R-tidyverse] are `tidyr` [@R-tidyr] and `dplyr` [@R-dplyr]. First helps us to swap from 'wide' to 'long' table format (and back). The second package contains set of tools to easily manipulate rows, columns or even single cells in *data frame*. It is extremely powerful tool, which speeds up work with data sets so much, that after few times dealing with it, you will leave traditional spread sheet forever.
### Wide vs. long tables
Lets start with changing wide format table into long format table.
```{r echo = TRUE, eval = TRUE}
wideTable <- data.frame(male = 1:10, female = 3:12)
wideTable
```
Than we just make a little *mumbo jumbo* and change it to long format:
```{r echo = TRUE, eval = TRUE, message = FALSE}
library('tidyr')
library('dplyr')
longTable <- gather(wideTable, key = sex, value = number)
longTable[7:12, ]
```
That is how easy it can be done. But what if table is more complicated?
```{r echo = TRUE, eval = TRUE}
wideTable2 <- data.frame(male = 1:10,
female = 3:12,
type = rep(c('bacteria', 'virus'), times = 5),
group = rep(c('a', 'b'), each = 5))
wideTable2
gather(wideTable2, sex, number, -c(3, 4))[7:13, ]
```
In example above, using expression `-(3, 4)` we indicated that we do not want to gather columns 3 and 4.
Even more complicated?
```{r echo = TRUE, eval = TRUE}
wideTable3 <- data.frame(male = rep(1:10, each = 2),
female = rep(3:12, times = 2),
type = rep(c('bacteria','virus'), times = 10),
group = rep(c('a','b'), each = 10),
day = 1:10)
wideTable3[7:13, ]
gather(wideTable3, sex, number, 1:2) %>%
spread(group, number) %>% .[7:13, ]
```
Here our data set had additional column `group` storing values `a` and `b`. But imagine that `group` is not a variable, but it just stores the names of variables - which are a and b. This somehow might be frustrating, to decide if those are really separate variables or not. You might run into problem, that your column named `environmental factor` contains values: *pH*, *conductivity*, and *oxygen concentration*. This would be straightforward as each of this is different variable and you should spread this column into three different variables. On the other hand you might see a column that contains *minimum temperature* and *maximum temperature*. This would not be as straightforward and decision upon spreading this column would strongly depend on the context. Nonetheless, using `spread()` function we were able to transform values from this column as separate columns. We also used *piping operator* `%>%`. It is a shortcut, which allows us to pass the left side as an argument to the function on the right side of operator. In general it means that writing `a %>% funtion(b)` actually is translated into `function(a, b)`. In the beginning this idea might be not very useful for you, but with time it will get helpful. Mainly because your code gets better structure and you can perform multiple operations without storing it in variables (which you do not want, since it consumes memory).
There is also one more common case that I should shortly mention - compound variable. It is a variable that stores multiple values in a single column. E.g. city and district, age and sex, sex and smoking, blood type and RH, etc. With `tidyr` is very easy to deal with it.
```{r echo = TRUE, eval = TRUE}
wideTable4 <- data.frame(type = rep(c('bacteria.a', 'virus.a'), times = 5))
wideTable4
wideTable4 %>% separate(type, c('organism', 'type'), sep = '\\.')
```
### World of `dplyr`
`dplyr` is a library containing several useful functions, designed to ease all kinds of transformations, selections and filtering of your data frame. As the number and possibilities of functions are really huge, here I will concentrate only on some mostly used ones. Of course there is a well described documentation if you want to get deeper.
### `select` columns and `filter` rows
Those are the most basic operations on *data frames*. `select()` function allows you to choose which columns you use. The biggest advantage of using this function instead of simple addressing, is that you can use some special functions inside it: `starts_with(), ends_with(), contains(), matches(), num_range()` which are helpful when using large data sets with numerous columns. There is also similar function `rename()` which can be used to change variable names, but in results (contrary to `select()`) it keeps all the variables.
Lets look how the thing works on some simple examples:
```{r echo = TRUE, eval = TRUE}
head(select(wideTable3, starts_with('gr')), 3)
head(select(wideTable3, -starts_with('gr')), 3)
head(select(wideTable3, contains('ale')), 3)
head(rename(wideTable3, Male = male), 3)
```
Filtering rows is also straightforward. Inside function `filter()` you can use logical operators (`&, |, xor, !`), comparisons (e.g. `==` or `>=`), or functions (`is.na(), between(), near()`).
```{r echo = TRUE, eval = TRUE}
filter(wideTable3, group == 'a')
filter(wideTable3, type != 'bacteria')
filter(wideTable3, near(female, 5))
```
### `mutate` and `transmute`
Both functions are widely used in data manipulation in **R**. Their main purpose is to create new variable (usually from existing ones) in data frame. Lets look on examples:
```{r echo = TRUE, eval = TRUE}
select(wideTable3, male) %>%
mutate(cumulativeMaleSum = cumsum(male)) %>%
head(5)
select(wideTable3, male, female) %>%
mutate(cumulativeMaleSum = cumsum(male),
femaleLog = log(female),
cumSumFemaleLog = cumsum(femaleLog)) %>%
head(5)
```
As you can see in second example, when we use `mutate()` function, the newly created variables (in this example `femaleLog`) are available immediately so we can use them to create another variable (in this case `cumSumFemaleLog`) within one function call. In this example, I also used piping operator, because calculating intermediate steps and storing them as a result, which can be used later is useless, as we are interested only in the final outcome. The biggest advantage of this procedure is that we keep our *Environment* clean and preserve memory -- which is very important in long and memory consuming projects. Last but not least, the difference between `mutate()` and `transmute()` is that the latter do not preserve all variables, only the ones you created.
### `group_by` and `summarise`
`summarise()` is commonly used function on grouped data. It allows to calculate many typical descriptive statistics (like `mean()` or `quantile()`) for particular groups in you data set, as well as it can count number of observations (`n()` function), or number of unique observations (`distinct()`). To see how it works in practice lets look back on our example and calculate mean value for `males` and `female`, as well as day range and number of cases for groups derived from `type`.
```{r echo = TRUE, eval = TRUE}
wideTable3 %>%
group_by(type) %>%
summarise(meanF = mean(female),
meanM = mean(male),
numberOfDays = (range(day)[2]-range(day)[1]),
numberOfCases = n())
```
Easy.
## Is there anything more in `dplyr` library?
Yes. Actually I presented here only very very tiny fracture of `dplyr` possibilities just to familiarize you with syntax and using piping operator. When you go deeper into world of data frames transformation, you will find other commonly used functions like: different kinds of joining data frames (similar to **SQL** joining tables), conditional selecting, filtering, renaming rows and columns, extracting values or arranging your data frame. Thankfully this procedures are so common that even if you won't grasp it immediately from functions description, you will still find hundreds of tutorials on the web, or help in *StackOverflow*.