generated from statOmics/Rmd-website
-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy path08-experimentalDesign2.Rmd
757 lines (558 loc) · 24.5 KB
/
08-experimentalDesign2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
---
title: "8.3 Experimental Design II: Randomized Complete Block Designs and Pseudo-replication"
author: "Lieven Clement"
date: "statOmics, Ghent University (https://statomics.github.io)"
---
<a rel="license" href="https://creativecommons.org/licenses/by-nc-sa/4.0"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a>
```{r}
library(tidyverse)
```
# Randomized complete block designs
\[\sigma^2= \sigma^2_{bio}+\sigma^2_\text{lab} +\sigma^2_\text{extraction} + \sigma^2_\text{run} + \ldots\]
- Biological: fluctuations in protein level between mice, fluctations in protein level between cells, ...
- Technical: cage effect, lab effect, week effect, plasma extraction, MS-run, ...
# Nature methods: Points of significance - Blocking
[https://www.nature.com/articles/nmeth.3005.pdf](https://www.nature.com/articles/nmeth.3005.pdf)
- Oneway anova is a special case of a completely randomized design:
- the experimental units are sampled at random from the population
- the treatments are randomized to the experimental units
- Every experimental unit receives one treatment and its response variable is measured once.
- In a block design the experimental units are blocks which are sampled at random of the population of all possible blocks.
- The RCB restricts randomization: the treatments are randomized within the blocks.
- it cannot be analysed using oneway anova.
- The paired design is a special case of an RCB: block size equals 2.
- Within block effects can be assessed using the lm function in R.
# Mouse example
## Background
```{r echo=FALSE, out.width="50%"}
knitr::include_graphics("https://raw.githubusercontent.com/statOmics/PSLS21/master/figures/mouseTcell_RCB_design.png")
```
Duguet et al. (2017) MCP 16(8):1416-1432. doi: 10.1074/mcp.m116.062745
- All treatments of interest are present within block!
- We can estimate the effect of the treatment within block!
We focus on one protein
- The measured intensities are already on the log2-scale. Differences between the intensities can thus be interpreted as log2 FC.
- P16045 or Galectin-1.
- Function: "Lectin that binds beta-galactoside and a wide array of complex carbohydrates. Plays a role in regulating apoptosis, cell proliferation and cell differentiation. Inhibits CD45 protein phosphatase activity and therefore the dephosphorylation of Lyn kinase. Strong inducer of T-cell apoptosis." (source: [uniprot](https://www.uniprot.org/uniprot/P16045))
## Data Exploration
```{r}
mouse <- read_tsv("https://raw.githubusercontent.com/statOmics/PSLSData/main/mouseP16045.txt")
mouse
```
```{r}
mouse %>%
ggplot(aes(x = mouse, y = intensity, col = celltype)) +
geom_point()
```
```{r}
mouse %>%
ggplot(aes(x = celltype, y = intensity)) +
geom_point() +
geom_line(aes(group = mouse)) +
geom_point(aes(col = celltype))
```
The plots show evidence for
- an upregulation of the protein expression in regulatory T-cells and
- a considerable variability in expression from animal to animal!
## Paired Analysis
This is a paired design, which is the most simple form of randomized complete block design.
In the introduction to statistical inference we would have analyzed the data by differencing.
```{r}
mouseWide <- mouse %>%
spread(celltype, intensity) %>%
mutate(delta = Treg - Tcon)
mouseWide
```
### Data exploration
- Boxplot of difference
```{r}
mouseWide %>%
ggplot(aes(x = "", y = delta)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter()
```
- Summary statistics
```{r}
deltaSum <- mouseWide %>%
summarize(
mean = mean(delta, na.rm = TRUE),
sd = sd(delta, na.rm = TRUE),
n = n()
) %>%
mutate(se = sd / sqrt(n))
deltaSum
```
Note, that the intensity data are not independent because we measured the expression in regulatory and ordinary T-cells for every animal
- Covariance and correlation between expression in both celltypes
```{r}
cor(mouseWide[, c("Tcon", "Treg")])
var(mouseWide[, c("Tcon", "Treg")])
var(mouseWide[, c("Tcon", "Treg")]) %>%
diag() %>%
sqrt()
```
- There is indeed a large correlation between the expression of the protein in conventional and regulatory T-cells.
- Standard deviation of difference?
$$
\begin{array}{lcl}
\text{sd}_{x_r - x_c} &=& \sqrt{1^2\hat \sigma_r^2 + (-1)^2 \hat \sigma_c^2 + 2 \times 1
\times -1
\times \hat\sigma_{r,c}}\\
&=&\sqrt{\hat \sigma_r^2 + \hat \sigma_c^2 - 2 \times \hat\sigma_{r,c}}
\end{array}
$$
```{r}
sdDelta2 <- (c(-1, 1) %*% var(mouseWide[, c("Tcon", "Treg")]) %*% c(-1, 1)) %>%
sqrt()
sdDelta2
seDeltaBar <- sdDelta2 / sqrt(deltaSum$n)
seDeltaBar
deltaSum
```
- The standard deviation on the difference is much lower because of the strong correlation!
- Note, that the paired design enabled us to estimate the difference in log2-expression between the two cell types in every animal (log2 FC).
```{r}
t.test(delta~1, mouseWide)
```
## Randomized complete block analysis
We can also analyse the design using a linear model, i.e. with
- a main effect for cell type and
- a main effect for the block factor mouse
Because we have measured the two cell types in every mouse, we can thus estimate the average log2-intensity of the protein in the T-cells for each mouse.
```{r}
lmRCB <- lm(intensity ~ celltype + mouse, mouse)
plot(lmRCB, which = c(1, 2, 3))
```
If you have doubts on that your data violates the assumptions you can always simulate data from a model with similar effects as yours but where are distributional assumptions hold and compare the residual plots.
```{r}
design <- model.matrix(intensity ~ celltype + mouse, mouse)
sigmaMouse <- sqrt(car::Anova(lmRCB, type = "III")["mouse", "Sum Sq"] / car::Anova(lmRCB, type = "III")["mouse", "Df"])
betas <- lmRCB$coefficients
nMouse <- mouse$mouse %>%
unique() %>%
length()
par(mfrow = c(3, 3))
for (i in 1:9)
{
mouseEffect <- rnorm(nMouse, sd = sigmaMouse)
betasMouse <- mouseEffect[-1] - mouseEffect[1]
betas[-c(1:2)] <- betasMouse
ysim <- design %*% betas + rnorm(nrow(design), sd = sigma(lmRCB))
plot(lm(ysim ~ -1 + design), which = 1)
}
```
The deviations seen in our plot are in line with what those observed in simulations under the model assumptions.
Hence, we can use the model for statistical inference.
```{r}
anovaRCB <- car::Anova(lmRCB, type = "III")
summary(lmRCB)
t.test(delta ~1 , mouseWide)
anovaRCB
```
Notice that
1. the estimate, se, t-test statistic and p-value for the celltype effect of interest is the same as in the paired t-test!
2. the anova analysis shows that we model the total variability in the protein expression in T-cells using variability due to the cell type (CT), the variability from mouse to mouse (M) and residual variability (R) within mouse that we cannot explain. Indeed,
$$
\begin{array}{lcl}
\text{SSTot} &=& \text{SSCT} + \text{SSM} + \text{SSE}\\
`r round(anovaRCB[-1,"Sum Sq"] %>% sum(),1)` &=& `r round(anovaRCB["celltype","Sum Sq"],1)` + `r round(anovaRCB["mouse","Sum Sq"],1)` + `r round(anovaRCB["Residuals","Sum Sq"],1)`
\end{array}
$$
So the celltype and the mouse effect explain respectively
$$
\begin{array}{ll}
\frac{\text{SSCT}}{\text{SSTot}}\times 100&\frac{\text{SSM}}{\text{SSTot}}\times 100\\\\
`r round(anovaRCB["celltype","Sum Sq"]/sum(anovaRCB[-1,"Sum Sq"])*100,1)`&
`r round(anovaRCB["mouse","Sum Sq"]/sum(anovaRCB[-1,"Sum Sq"])*100,1)`\\
\end{array}
$$
percent of the variability in the protein expression values and
$$
\frac{\text{SSE}}{\text{SSTot}} \times 100 = `r round(anovaRCB["Residuals","Sum Sq"]/sum(anovaRCB[-1,"Sum Sq"])*100,1)`
$$
percent cannot be explained.
Note, that
- the variability from mouse to mouse is the largest source of variability in the model,
- This variability can be estimated with the RCB design and
- can thus be isolated from the residual variability
- leading to a much higher precision on the estimate of the average log2 FC between regulatory and conventional T-cells that what would be obtained with a CRD!
Note, that the RCB also has to sacrifice a number of degrees of freedom to estimate the mouse effect, here 6 DF.
Hence, the power gain of the RCB is a trade-off between the variability that can be explained by the block effect and the loss in DF.
If you remember the equation of variance covariance matrix of the parameter estimators
$$
\hat{\boldsymbol{\Sigma}}^2_{\hat{\boldsymbol{\beta}}}
=\left(\mathbf{X}^T\mathbf{X}\right)^{-1}\hat\sigma^2
$$
you can see that the randomized complete block will have an impact on
- $\left(\mathbf{X}^T\mathbf{X}\right)^{-1}$ as well as
- on $\sigma^2$ of the residuals!
$\rightarrow$ We were able to isolate the variance in expression between animals/blocks from our analysis!
$\rightarrow$ This reduces the variance of the residuals and leads to a power gain if the variability between mice/blocks is large.
Also note that,
$$
\hat\sigma^2 = \frac{\text{SSE}}{n-p} = \frac{SSTot - SSM - SSCT}{n-p}
$$
- Hence, blocking is beneficial if the reduction in sum of squares of the residuals is large compared to the degrees of freedom that are sacrificed.
- Thus if SSM can explain a large part of the total variability.
Further, the degrees of freedom affect the t-distribution that is used for inference, which has broader tails if the residual degrees of freedom $n-p$ are getting smaller.
# Power gain of an RCB vs CRD
In this section we will subset the original data in two experiments:
- A randomized complete block design with three mice
- A completely randomized design with six mice but where we only measure one cell type in each mouse.
```{r}
set.seed(859)
mRcb <- mouse %>%
pull(mouse) %>%
unique() %>%
sample(size = 3)
rcbSmall <- mouse %>% filter(mouse %in% mRcb)
rcbSmall
```
```{r}
mCrd <- mouse %>%
pull(mouse) %>%
unique() %>%
sample(size = 6)
crdSmall <-
bind_rows(
mouse %>%
filter(mouse %in% mCrd[1:3]) %>%
filter(celltype == "Tcon"),
mouse %>%
filter(mouse %in% mCrd[-(1:3)]) %>%
filter(celltype == "Treg")
)
crdSmall
```
So in both experiments we need to do six mass spectrometry runs.
```{r}
lmCRDSmall <- lm(intensity ~ celltype, crdSmall)
summary(lmCRDSmall)
anova(lmCRDSmall)
```
```{r}
lmRCBSmall <- lm(intensity ~ celltype + mouse, rcbSmall)
anova(lmRCBSmall)
summary(lmRCBSmall)
```
Note, that
- we are able to pick up the upregulation between regulatory T-cells and ordinary T-cells with the RCB but not with the CRD.
- the standard error of the $\log_2\text{FC}_\text{Treg-Tcon}$ estimate is a factor `r round(summary(lmCRDSmall)$coef["celltypeTreg","Std. Error"]/summary(lmRCBSmall)$coef["celltypeTreg","Std. Error"],1)` smaller for the RCB design!
A poor data analysis who forgets about the blocking will be back at square one:
```{r}
wrongRbc <- lm(intensity ~ celltype, rcbSmall)
anova(wrongRbc)
summary(wrongRbc)
```
So, the block to block variability is then absorbed in the variance estimator of the residual.
Of course, we are never allowed to analyse an RCB with a model for a CRD (without block factor) because blocking imposes a randomization restriction: we randomize the treatment within block.
## Power gain of blocking?
### Power for completely randomized design
```{r}
varBetweenPlusWithin <- sum(car::Anova(lmRCB, type = "III")[c("mouse", "Residuals"), "Sum Sq"]) / sum(car::Anova(lmRCB, type = "III")[c("mouse", "Residuals"), "Df"])
alpha <- 0.05
nSim <- 20000
b0 <- 0
sd <- sqrt(varBetweenPlusWithin)
ns <- c(3, 7)
deltas <- lmRCB$coefficients["celltypeTreg"]
L <- limma::makeContrasts("celltypeTreg", levels = c("(Intercept)", "celltypeTreg"))
powerFast <- matrix(NA, nrow = length(ns) * length(deltas), ncol = 3) %>% as.data.frame()
names(powerFast) <- c("b1", "n", "power")
i <- 0
for (n in ns)
{
n1 <- n2 <- n
### Simulation
predictorData <- data.frame(celltype = rep(c("Tcon", "Treg"), c(n1, n2)) %>% as.factor())
design <- model.matrix(~celltype, predictorData)
for (b1 in deltas)
{
ySim <- rnorm(nrow(predictorData) * nSim, sd = sd)
dim(ySim) <- c(nrow(predictorData), nSim)
ySim <- ySim + c(design %*% c(b0, b1))
ySim <- t(ySim)
### Fitting
fitAll <- limma::lmFit(ySim, design)
### Inference
varUnscaled <- c(t(L) %*% fitAll$cov.coefficients %*% L)
contrasts <- fitAll$coefficients %*% L
seContrasts <- varUnscaled^.5 * fitAll$sigma
tstats <- contrasts / seContrasts
pvals <- pt(abs(tstats), fitAll$df.residual, lower.tail = FALSE) * 2
i <- i + 1
powerFast[i, ] <- c(b1, n, mean(pvals < alpha))
}
}
powerFast
```
Because we have a simple 2 group comparison we can also calculate the power using the `power.t.test` function.
```{r}
power.t.test(n = 3, delta = lmRCB$coefficients["celltypeTreg"], sd = sqrt(varBetweenPlusWithin))
power.t.test(n = 7, delta = lmRCB$coefficients["celltypeTreg"], sd = sqrt(varBetweenPlusWithin))
```
## Power for randomized complete block design
```{r}
alpha <- 0.05
nSim <- 20000
b0 <- 0
sd <- sigma(lmRCB)
sdMouse <- sqrt(car::Anova(lmRCB)["mouse", "Sum Sq"] / car::Anova(lmRCB)["mouse", "Df"])
ns <- c(3, 7)
deltas <- lmRCB$coefficients["celltypeTreg"]
powerFastBlocking <- matrix(NA, nrow = length(ns) * length(deltas), ncol = 3) %>% as.data.frame()
names(powerFastBlocking) <- c("b1", "n", "power")
i <- 0
for (n in ns)
{
### Simulation
predictorData <- data.frame(celltype = rep(c("Tcon", "Treg"), each = n) %>% as.factor(), mouse = paste0("m", rep(1:n, 2)))
design <- model.matrix(~ celltype + mouse, predictorData)
L <- limma::makeContrasts("celltypeTreg", levels = colnames(design))
for (b1 in deltas)
{
ySim <- rnorm(nrow(predictorData) * nSim, sd = sd)
dim(ySim) <- c(nrow(predictorData), nSim)
mouseEffect <- rnorm(n, sd = sdMouse)
betasMouse <- mouseEffect[-1] - mouseEffect[1]
ySim <- ySim + c(design %*% c(b0, b1, betasMouse))
ySim <- t(ySim)
### Fitting
fitAll <- limma::lmFit(ySim, design)
### Inference
varUnscaled <- c(t(L) %*% fitAll$cov.coefficients %*% L)
contrasts <- fitAll$coefficients %*% L
seContrasts <- varUnscaled^.5 * fitAll$sigma
tstats <- contrasts / seContrasts
pvals <- pt(abs(tstats), fitAll$df.residual, lower.tail = FALSE) * 2
i <- i + 1
powerFastBlocking[i, ] <- c(b1, n, mean(pvals < alpha))
}
}
powerFastBlocking
```
Note, that the power is indeed much larger for the randomized complete block design.
Both for the designs with 6 and 14 mass spectrometer runs.
Because we have an RCB with a block size of 2 (paired design) we can also calculate the power using the `power.t.test function` with `type = "one.sample"` and `sd` equal to the standard deviation of the difference.
```{r}
power.t.test(n = 3, delta = mean(mouseWide$delta), sd = sd(mouseWide$delta))
power.t.test(n = 7, delta = mean(mouseWide$delta), sd = sd(mouseWide$delta))
```
Note, that the power is slightly different because for the power.t.test function we conditioned on the mice from our study. While in the simulation study we generated data for new mice by simulating the mouse effect from a normal distribution.
## Impact of the amount of variability that the blocking factor explains on the power?
We will vary the block effect explains
$$
\frac{\sigma^2_\text{between}}{\sigma^2_\text{between}+\sigma^2_\text{within}}=1-\frac{\sigma^2_\text{within}}{\sigma^2_\text{between}+\sigma^2_\text{within}}
$$
So in our example that is the ratio between the variability between the mice and the sum of the variability between and within the mice. Note, that the within mouse variability was the variance of the errors of the RCB. The ratio for our experiment equals
```{r}
varBetweenPlusWithin <- sum(car::Anova(lmRCB, type = "III")[c("mouse", "Residuals"), "Sum Sq"]) / sum(car::Anova(lmRCB, type = "III")[c("mouse", "Residuals"), "Df"])
varWithin <- car::Anova(lmRCB)["Residuals", "Sum Sq"] / car::Anova(lmRCB)["Residuals", "Df"]
varBetweenPlusWithin
varWithin
1 - varWithin / varBetweenPlusWithin
```
```{r}
alpha <- 0.05
nSim <- 20000
b0 <- 0
varBetweenPlusWithin <- sum(car::Anova(lmRCB, type = "III")[c("mouse", "Residuals"), "Sum Sq"]) / sum(car::Anova(lmRCB, type = "III")[c("mouse", "Residuals"), "Df"])
ns <- c(3, 7)
deltas <- lmRCB$coefficients["celltypeTreg"]
fracVars <- seq(0, .95, .05)
powerFastBlockingLow <- matrix(NA, nrow = length(ns) * length(fracVars), ncol = 3) %>% as.data.frame()
names(powerFastBlockingLow) <- c("fracVars", "n", "power")
i <- 0
for (n in ns)
{
### Simulation
predictorData <- data.frame(celltype = rep(c("Tcon", "Treg"), each = n) %>% as.factor(), mouse = paste0("m", rep(1:n, 2)))
design <- model.matrix(~ celltype + mouse, predictorData)
L <- limma::makeContrasts("celltypeTreg", levels = colnames(design))
for (fracVar in fracVars)
{
sd <- sqrt(varBetweenPlusWithin * (1 - fracVar))
sdMouse <- sqrt(varBetweenPlusWithin * fracVar)
for (b1 in deltas)
{
ySim <- rnorm(nrow(predictorData) * nSim, sd = sd)
dim(ySim) <- c(nrow(predictorData), nSim)
mouseEffect <- rnorm(n, sd = sdMouse)
betasMouse <- mouseEffect[-1] - mouseEffect[1]
ySim <- ySim + c(design %*% c(b0, b1, betasMouse))
ySim <- t(ySim)
### Fitting
fitAll <- limma::lmFit(ySim, design)
### Inference
varUnscaled <- c(t(L) %*% fitAll$cov.coefficients %*% L)
contrasts <- fitAll$coefficients %*% L
seContrasts <- varUnscaled^.5 * fitAll$sigma
tstats <- contrasts / seContrasts
pvals <- pt(abs(tstats), fitAll$df.residual, lower.tail = FALSE) * 2
i <- i + 1
powerFastBlockingLow[i, ] <- c(fracVar, n, mean(pvals < alpha))
}
}
}
powerFastBlockingLow
```
```{r}
gg_color_hue <- function(n) {
hues <- seq(15, 375, length = n + 1)
hcl(h = hues, l = 65, c = 100)[1:n]
}
cols <- gg_color_hue(2)
powerFastBlockingLow %>%
as.data.frame() %>%
mutate(n = as.factor(n)) %>%
ggplot(aes(fracVars, power, group = n, color = n)) +
geom_line() +
geom_hline(yintercept = powerFast %>% filter(n == 3) %>% pull(power), color = cols[1]) +
annotate("text",
label = "CRD (n=3)",
x = 0.05, y = powerFast %>% filter(n == 3) %>% pull(power) + .02, size = 3, colour = cols[1]
) +
geom_hline(yintercept = powerFast %>% filter(n == 7) %>% pull(power), color = cols[2]) +
annotate("text",
label = "CRD (n=7)",
x = 0.05, y = powerFast %>% filter(n == 7) %>% pull(power) + .02, size = 3, colour = cols[2]
) +
xlab(expression(~ sigma[between]^2 / (sigma[between]^2 + sigma[within]^2))) +
geom_vline(xintercept = 1 - varWithin / varBetweenPlusWithin) +
xlim(0, 1)
```
- So if the variance that is explained by the block effect is small you will loose power as compared to the analysis with a CRD design. Indeed, then
- SSE does not reduce much and
- n$_\text{blocks}$-1 degrees of freedom have been sacrificed.
- As soon as the block effect explains a large part of the variability it is very beneficial to use a randomized complete block design!
- Note, that the same number of mass spectrometry runs have to be done for both the RCB and CRD design. However, for the RCB we only need half of the mice.
# Penicillin example
The production of penicillin corn steep liquor is used. Corn steep liquor is produced in blends and there is a considerable variability between the blends. Suppose that
- four competing methods have to be evaluated to produce penicillin (A-D),
- one blends is sufficient for four runs of a penicillin batch reactor and
- the 20 runs can be scheduled for the experiment.
How would you assign the methods to the blends.
```{r}
data(penicillin, package = "faraway")
table(penicillin$blend, penicillin$treat)
```
## Data
```{r}
head(penicillin)
matrix(penicillin$yield, nrow = 5, ncol = 4, byrow = TRUE, dimnames = list(levels(penicillin$blend), levels(penicillin$treat)))
```
```{r}
penicillin %>%
ggplot(aes(x = blend, y = yield, group = treat, color = treat)) +
geom_line() +
geom_point()
```
```{r}
penicillin %>%
ggplot(aes(x = treat, y = yield, group = blend, color = blend)) +
geom_line() +
geom_point()
```
## Analysis
We analyse the yield using
- a factor for blend and
- a factor for treatment.
```{r}
lmPen <- lm(yield ~ treat + blend, data = penicillin)
plot(lmPen)
car::Anova(lmPen, type = "III")
```
We conclude that the effect of the treatment on the penicillin yield is not significant at the 5% level of significance (p = `r car::Anova(lmPen,type="III")["treat","Pr(>F)"] %>% round(.,2)`.
We also observe that there is a large effect of the blend on the yield.
Blend explains about `r round(car::Anova(lmPen,type="III")["blend","Sum Sq"]/sum(car::Anova(lmPen,type="III")[-1,"Sum Sq"])*100,1)`% of the variability in the penicillin yield.
# Pseudo-replication
A study on the facultative pathogen Francisella tularensis was conceived by Ramond et al. (2015) [12].
- F. tularensis enters the cells of its host by phagocytosis.
- The authors showed that F. tularensis is arginine deficient and imports arginine from the host cell via an arginine transporter, ArgP, in order to efficiently escape from the phagosome and reach the cytosolic compartment, where it can actively multiply.
- In their study, they compared the proteome of wild type F. tularensis (WT) to ArgP-gene deleted F. tularensis (knock-out, D8).
- The sample for each bio-rep was run in technical triplicate on the mass-spectrometer.
- We will use data for the protein 50S ribosomal protein L5 [A0Q4J5](https://www.uniprot.org/uniprot/A0Q4J5)
## Data exploration
```{r}
franc <- read_tsv("https://raw.githubusercontent.com/statOmics/PSLSData/main/francisellaA0Q4J5.txt")
franc
```
```{r}
franc %>%
ggplot(aes(biorep, intensityLog2, color = genotype)) +
geom_point()
```
- Response?
- Experimental unit?
- Observational unit?
- Factors?
$\rightarrow$ Pseudo-replication, randomisation to bio-repeat and each bio-repeat measured in technical triplicate.
$\rightarrow$ If we would analyse the data using a linear model based on each measured intensity, we would act as if we had sampled 18 bio-repeats.
$\rightarrow$ Effect of interest has to be assessed between bio-repeats. So block analysis not possible!
If the same number of pseudo-replicates/technical replicates are available for each bio-repeat:
- average first over bio-repeats to obtain independent measurements
- averages will then have the same precision
- assess effect of treatment using averages
- **Caution: never summarize over bio-repeats/experimental units**
```{r}
lmBiorep <- lm(intensityLog2 ~ -1 + biorep, franc)
lmBiorep
```
```{r}
francSum <- data.frame(genotype = rep(c("D8", "WT"), each = 3) %>% as.factor() %>% relevel("WT"), intensityLog2 = lmBiorep$coef)
francSum
```
```{r}
lmSum <- lm(intensityLog2 ~ genotype, francSum)
summary(lmSum)
```
## Wrong analysis
```{r}
lmWrong <- lm(intensityLog2 ~ genotype, franc)
summary(lmWrong)
```
Note, that the analysis where we ignore that we have multiple technical repeats for each bio-repeat returns results that are much more significant because we act as if we have much more independent observations.
### No Type I error control!
```{r}
sigmaWithin <- sigma(lmBiorep)
sigmaBetween <- sigma(lmSum)
xBiorep <- model.matrix(~ -1 + biorep, franc)
xWrong <- model.matrix(~genotype, franc)
set.seed(2523)
nSim <- 1000
resWrong <- matrix(NA, nSim, 4) %>% as.data.frame()
names(resWrong) <- c("Estimate", "Std. Error", "t value", "pvalue")
resCorrect <- resWrong
genotype <- franc$genotype
genotypeSum <- francSum$genotype
biorep <- franc$biorep
for (i in 1:nSim)
{
biorepSim <- rnorm(ncol(xBiorep), sd = sigmaBetween)
ySim <- xBiorep %*% biorepSim + rnorm(nrow(xBiorep), sd = sigmaWithin)
ySum <- lm(ySim ~ biorep)$coefficient
resWrong[i, ] <- summary(lm(ySim ~ genotype))$coefficient[2, ]
resCorrect[i, ] <- summary(lm(ySum ~ genotypeSum))$coefficient[2, ]
}
mean(resCorrect$pvalue < 0.05)
mean(resWrong$pvalue < 0.05)
```
```{r}
qplot(resCorrect$pvalue, geom = "histogram", boundary = c(0, 1)) +
stat_bin(breaks = seq(0, 1, .1)) +
xlab("pvalue") +
ggtitle("Correct analysis")
qplot(resWrong$pvalue, geom = "histogram", boundary = c(0, 1)) +
stat_bin(breaks = seq(0, 1, .1)) +
xlab("pvalue") +
ggtitle("Wrong analysis")
```
- So we observe that the analysis that does not account for pseudo-replication is too liberal!
- The analysis that first summarizes over the technical repeats leads to correct p-values and correct type I error control!
- What to do when you have an unequal number of technical repeats: more advanced methods are required
- e.g. mixed models
- The mixed model framework can model the correlation structure in the data
- Mixed models can also be used when inference between and within blocks is needed, e.g. split-plot designs.
- But, mixed models are beyond the scope of the lecture series.
# Nature methods: Split-plot designs
[Nature Methods - Point of Significance - Split-plot Designs](https://www.nature.com/articles/nmeth.3293.pdf)