-
Notifications
You must be signed in to change notification settings - Fork 2
/
randomisation_slides.Rmd
251 lines (175 loc) · 8.11 KB
/
randomisation_slides.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
---
title: "Introduction to randomisation, bootstrap, and Monte Carlo methods in R"
subtitle: https://stirlingcodingclub.github.io/randomisation/randomisation_notes.html
author: "Brad Duthie"
date: "18 September 2019"
output:
beamer_presentation:
theme: "default"
colortheme: "default"
fonttheme: "default"
ioslides_presentation: default
slidy_presentation: default
header-includes:
- \usepackage{hyperref}
- \usepackage{tikz}
- \usepackage{caption}
colorlinks: true
linkcolor: blue
urlcolor: blue
---
```{r, echo = FALSE}
library(knitr);
opts_chunk$set(echo = FALSE);
```
## Randomisation in R for statistical analyses
**Being able to generate random numbers or sample random subsets of a list is very useful for all types of modelling** \pause
1. The R programming languages includes many base functions for random number and list sampling
- `rnorm`, `runif`, `rbinom`, `rpois`
- `sample` \pause
2. Look at each of these functions briefly, then show how they can be used in frequentist statistics
- Hypothesis testing (i.e., getting p-values)
- Confidence intervals
## Random sampling using base functions in R: `rnorm`
```{r, echo = TRUE}
random_norms <- rnorm(n = 10000, mean = 0, sd = 1);
```
\vspace{4 mm}
```{r, echo = FALSE, fig.height = 6}
hist(random_norms, xlab = "Random value (X)", col = "blue",
main = "", cex.lab = 1.5, cex.axis = 1.5);
```
## Random sampling using base functions in R: `runif`
```{r, echo = TRUE}
rand_unifs <- runif(n = 10000, min = 0, max = 1);
```
\vspace{4 mm}
```{r, echo = FALSE, fig.height = 6}
hist(rand_unifs, xlab = "Random value (X)", col = "blue",
main = "", cex.lab = 1.5, cex.axis = 1.5);
```
## Random sampling using base functions in R: `rbinom`
The `rbinom` function is a bit more tricky.
- `n`: Number of observations
- `size`: Number of independent trials *within an observation*
- `prob`: Probability that a trial is successful
If we want to flip one fair coin 1000 times:
```{r, echo = TRUE}
rand_binoms <- rbinom(n = 1, size = 1000, prob = 0.5);
```
If we want to flip 10 fair coins 1000 times:
```{r, echo = TRUE}
rand_binoms <- rbinom(n = 10, size = 1000, prob = 0.5);
```
## Random sampling using base functions in R: `rbinom`
```{r, echo = TRUE}
rand_binoms <- rbinom(n = 10000, size = 1000, prob = 0.5);
```
\vspace{4 mm}
```{r, echo = FALSE, fig.height = 6}
hist(rand_binoms, xlab = "Random value (X)", col = "blue",
main = "", cex.lab = 1.5, cex.axis = 1.5);
```
## Random sampling using base functions in R: `rpois`
```{r, echo = TRUE}
rand_poissons <- rpois(n = 10000, lambda = 4);
```
\vspace{4 mm}
```{r, echo = FALSE, fig.height = 6}
hist(rand_poissons, xlab = "Random value (X)", col = "blue",
main = "", cex.lab = 1.5, cex.axis = 1.5);
```
<!--- Remove this slide for now.
## The availability of randomisation for statistical models
Computational speed greatly increased in the late 20$^{th}$ century
- New methods for approaching statistics became possible
- Programming tools makes custom analyses easier \pause
Coding in R with randomisation \& resampling very useful
- Custom analyses for non-standard statistical questions
- Modelling biological systems accessible (e.g., IBMS)
--->
## The logic of randomisation for frequentist statistics
**Logic of frequentist statistics becomes more intuitive when using randomisation and resampling methods** \pause
- **P-value**: For a particular statistical model, given that the null hypothesis is true, what is the probability of getting a statistic as or more extreme than the one observed? \pause
- **Confidence interval**: An interval within which the parameter mean would be expected to be contained with some confidence (e.g., 95\%) if the population were repeatedly resampled \pause
**By using randomisation and re-sampling, we can do statistical tests using a process that better reflects the logic underlying the frequentist approach.**
## The basic idea: a t-test
```{r, echo = TRUE}
data(iris); # Remove one of the three species below
dat <- iris[iris[,5] != "setosa",];
dat <- dat[,c(1, 5)]; # Remove unneeded columns
```
```{r, echo = FALSE}
library(knitr)
kable(dat[1:5,]);
```
- Two species include *Iris versicolor* and *I. virginica*
- Want to know difference between sepal lengths is significant
## The basic idea: a t-test {.smaller}
```{r, echo = TRUE}
t.test(dat[,1] ~ dat[,2], alternative = "two.sided");
```
## The basic idea: a t-test
**A different way to state the null hypothesis**: *Species identity does not have any effect on the mean difference between sepal lengths*
1. If we randomly assign species to sepal lengths, then calculate the difference between species means, the size of this difference is due to chance
2. If we repeat step 1 many times, we can generate a null distribution of differences between species means (i.e., what magnitudes of differences between sepal lengths are expected by chance)
3. We can compare the *actual observed* difference between species mean sepal lengths to the null distribution generated in step 2.
Proportion of differences that are as extreme or more extreme than the *actual observed* difference in the null distribution **is a p-value**.
## Randomisation to test for a difference between means
We can code the algorithm 1-3 below
```{r, echo = FALSE}
obs_diff <- mean(dat[dat[,2] == "versicolor", 1]) - mean(dat[dat[,2] == "virginica", 1])
```
```{r, echo = TRUE}
iter <- 99999; # Total iterations
diff <- NULL;
N <- dim(dat)[1]; # Total number of sepals
for(i in 1:iter){
sepal_smp <- sample(x = dat[,2], size = N);
versicolor <- which(sepal_smp == "versicolor");
virginica <- which(sepal_smp == "virginica");
mn_samp_ve <- mean(dat[versicolor, 1]);
mn_samp_vi <- mean(dat[virginica, 1]);
diff[i] <- mn_samp_ve - mn_samp_vi;
}
```
We now have a null distribution of mean differences between sepal lengths (`diff`)
## Randomisation to test for a difference between means
```{r, echo = FALSE, fig.height = 7}
hist(diff, xlab = "Null difference between sepal lengths (versicolor - virginica)", col = "blue",
main = "", cex.lab = 1.5, cex.axis = 1.5, xlim = c(-0.7, 0.6));
```
## Randomisation to test for a difference between means
```{r, echo = FALSE, fig.height = 7}
hist(diff, xlab = "Null difference between sepal lengths (versicolor - virginica)", col = "blue",
main = "", cex.lab = 1.5, cex.axis = 1.5, xlim = c(-0.7, 0.6));
arrows(x0 = obs_diff, x1 = obs_diff, y0 = 2000, y1 = 10, lwd = 3, length = 0.1);
```
## Randomisation to test for a difference between means
The actual difference between versicolor and virginica group means is `r obs_diff`.
```{r, echo = TRUE}
more_extreme <- sum(abs(diff) >= abs(obs_diff));
```
There were `r more_extreme` values as or more extreme in the null distribution generated with 99999 values
$$P < \frac{0+1}{99999 + 1} = 0.00001.$$
This is consistent with the value from `t.test`
## Practice problems
1. Generate a histogram of random numbers sampled from a gamma distribution with a shape parameter ($k$) of 2 and a scale parameter ($\theta$) of 3 using the `rgamma` function in base R.
2. Using the base R data set `trees`, calculate 95% confidence intervals for tree Girth, Height, and Volume. Note, to read in the data, type `data("trees")` in the console. You should then have the data table `trees`, with three columns and 31 rows ([appendix](https://stirlingcodingclub.github.io/randomisation/randomisation_notes.html#appendix) in the [notes](https://stirlingcodingclub.github.io/randomisation/randomisation_notes.html) has a bootstrap function).
3. Using the base R data set `mtcars`, use a randomisation test to see if there is a significant correlation between car miles per gallon (`mpg`) and car weight (`wt`)
\vspace{3mm}
<details>
```{r}
data(mtcars);
true_cor <- cor(mtcars[,1], mtcars[,6]);
iters <- 999;
the_cors <- NULL;
for(i in 1:iters){
samp_wt <- sample(mtcars[,6], size = dim(mtcars)[1]);
the_cors[i] <- cor(mtcars[,1], samp_wt);
}
more_extr <- sum(abs(the_cors) >= abs(true_cor));
p_val <- (more_extr + 1) / (iters + 1);
```
\hrule
**Notes**: https://stirlingcodingclub.github.io/randomisation/randomisation_notes.html