forked from juliaphilipp/chap_design
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathdesign.qmd
923 lines (606 loc) · 31.6 KB
/
design.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
---
title: 'Good scientific practice: <a href="https://www.huber.embl.de/msmb/Chap-Design.html">Design of High Throughput Experiments and their Analysis</a>'
author: Susan Holmes, Wolfgang Huber
date: 2023-01-17
date-format: iso
format:
revealjs:
theme: [default, GHGA.scss]
logo: fig/GHGA_short_Logo_orange.png
transition: slide
scrollable: true
slide-number: c/t
show-slide-number: all
auto-stretch: false
# auto-stretch is a huge source of grief for slide layout, see https://quarto.org/docs/presentations/revealjs/advanced.html --> stretch
html:
code-line-numbers: false
execute:
echo: false
warning: true
error: false
message: false
slide-level: 1
bibliography: refs.bib
editor_options:
chunk_output_type: console
---
# Why are we here?
```{r initialize}
## Set default options for code chunks
knitr::opts_chunk$set(cache = TRUE, autodep = TRUE)
set.seed(0xbedada)
```
::: {layout-ncol=2}
![](fig/Youngronaldfisher2.JPG){width="40%" fig-align="center"}
<font size=+2>
*To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.*<br>
Presidential Address to the First Indian Statistical Congress, 1938. Sankhya 4, 14-17.
</font>
R.A. Fisher, Pioneer of Statistics and Experimental Design
:::
# Design of High Throughput Experiments
::: {layout="[[70,30]]"}
This lecture is based on Chapter 13 of the book<br>
*Modern Statistics for Modern Biology*<br>
by Susan Holmes, Wolfgang Huber<br>
[https://www.huber.embl.de/msmb/](https://www.huber.embl.de/msmb/)
![](fig/book-cover-photo.jpg){width="10%" fig-align="center"}
:::
# Topics
- Resource allocation & experimental design: a matter of trade-offs & good judgement
- Dealing with the different types of variability; partitioning variability
- Experiments, studies, ...
- Power, sample size and efficiency.
- Transformations
- Things to worry about: dependencies, batch effects, unwanted variation.
- Compression, redundancy and sufficiency
- Computational best practices
![](fig/Garten-der-Lueste-mitte-oben.png){width="100%"}
# The Art of "Good Enough"
- Experimental design rationalizes the tradeoffs imposed by having finite resources.
- Our measurement instruments have limited resolution and precision; often we don't know these at the outset and have to collect preliminary data providing estimates.
- Sample sizes are limited for practical, economic, and sometimes ethical reasons.
- We may only be able to observe the phenomenon of interest indirectly rather than directly.
- Our measurements may be overlaid with nuisance factors over which we have limited control.
- There is little point in prescribing unrealistic ideals: we need to make pragmatic
choices that are feasible.
# Accuracy vs precision
![](fig/accuracy-vs-bias.png)
# Types of studies / experiments
## Experiment
- everything is exquisitely controlled
## Retrospective observational studies
- we take what we get, opportunistic; no control over study participants, assignment of important factors, confounding
## Prospective, controlled studies
- e.g., clinical trials
- randomization, blinding
- ethical constraints (incl. money and time).
## Meta-analysis
- We did not design the experiments or studies ourselves, nor collect the data.
- Retrospective analysis of data that already happen to exist.
# Illustration: experiment
Well-characterized cell line growing in laboratory conditions on defined media, temperature
and atmosphere.
We administer a precise amount of a drug, and after
72h we measure the activity of a specific pathway reporter.
![](fig/blog-banner_2017_feb_lab-comfort_3840x1200-2560x800.jpg)
# Illustration: challenges with studies
::: {.columns}
::: {.column width="60%"}
We recruit patients with a chronic disease and who fulfill
inclusion criteria, and we ask them to take a drug each day exactly at 6 am.
Before, and after 3 months, we perform a diagnostic test.
- People may forget to take the pill or take it at the wrong time.
- Some may feel that the disease got worse and stop taking the drug.
:::
::: {.column width="40%"}
![](fig/Cutting-stone-of-madness.png)
:::
:::
- Some may feel that the disease got better and stop taking the drug.
- Some may lead a healthy life-style, others drink, smoke and eat junk food.
- Maybe it's not one disease but several
- And all of these factors may be correlated with each other in complex ways.
What to do about this?
::: {.fragment}
- need much larger sample sizes than with an experiment
- randomization, blinding, documentation
:::
# Biological versus technical replicates
Imagine we want to test whether a weight loss drug works. Which of the
following designs is better?
- A person is weighed on milligram precision scales, with 20 replicates. He/she follows the diet, and four weeks later, (s)he is weighed again, with 20 replicates. The data are recorded by a specially trained technician.
- 20 people weigh themselves on their bathroom scales and self-report the number. Four weeks later, they weigh themselves and report again.
Some design decisions in HT biology are similar, if more subtle:
- sequencing libraries
- cell lines
- CRISPR guides
- drugs
# Quiz
For reliable variant calling with current sequencing technology, you need ca. $30\times$ coverage for a human genome.
In the *1000 Genomes Project*, the average depth of the data produced was 5.1 for 1092 individuals. Why was that study design chosen?
$\quad$
$\quad$
![](fig/Garten-der-Lueste-mitte-unten.png){width="80%" fig-align="center"}
# What exactly is a technical versus biological replicate?
In fact, the terminology is too simplistic. Variation can occur at many different levels:
- different CRISPR guide RNAs for the same target gene
- different drugs for the same target protein
- different cell line (organoid, animal) models for the same biological system
- different measurement technologies (e.g., multiplexed immunohistochemistry vs imaging mass cytometry)
- different machines for the same technology
- different variants of the protocol
- different technician, lab
We use [blocking](https://en.wikipedia.org/wiki/Blocking_(statistics)) and can use the tools of design matrices in linear models to record and account for it.
To the extent that we know about such nuisance factors and have kept track of them, we can (should) include them explicitly as bias terms in our analysis models.
If we did not keep track, we can try to use latent factor models (SVA, PEER, RUV) to identify them from the data.
# Error models: Noise is in the eye of the beholder
The efficiency of most biochemical or physical processes involving nucleic acid polymers depends on their sequence content, for instance, occurrences of long homopolymer stretches, palindromes, GC content.
These effects are not universal, but can also depend on factors like concentration, temperature, which enzyme is used, etc.
When looking at RNA-Seq data, should we treat GC content as noise or as bias?
![](fig/portrait-of-shi-xie-and-his-studio.png){width="80%" fig-align="center"}
# One person's noise can be another's bias
::: {.columns}
::: {.column width="10%"}
![](fig/HeadsorTails.jpg){fig-align="right" width="70%"}
:::
::: {.column width="90%"}
We may think that the outcome of tossing a coin is completely random.
:::
:::
If we meticulously registered the initial conditions of the
coin flip and solved the mechanical equations, we could predict which
side has a higher probability of coming up: noise becomes bias.
<div style="text-align: center;">
![Tosser](fig/cointosser3.png){width="50%"}
[P. Diaconis, S. Holmes, et al.](https://news.stanford.edu/pr/2004/diaconis-69.html)
</div>
We use mathematical and probabilistic modelling as a method to bring some order into our ignorance---but keep in mind that such models are an expression of our subjective
ignorance, not an objective property of the world.
# A lack of units
::: {.columns}
::: {.column width="60%"}
Pre-modern measurement systems measured lengths in feet, ells, inches (first joint of an index finger), weights in stones, volumes in multiples of the size of a wine jar, etc.
In the [International System of Units](https://en.wikipedia.org/wiki/International_System_of_Units), meters, seconds, kilograms, amperes, ... are defined based on universal physical constants.
:::
::: {.column width="40%"}
![](fig/2420_titelblatt_mathematikbuch.jpg)
:::
:::
There is no reliance on artefacts, and a meter measured by a lab in Canada using one instrument has the same meaning as a meter measured a year later by a lab in Heidelberg using a different instrument, by a space probe in the Kuiper belt, or by an extraterrestrial intelligence on Proxima Cen b.
Measurements in biology are, unfortunately, rarely that comparable.
Often, absolute values are not reported (these would require units), but only fold changes with regard to some local reference.
Even when absolute values exist (e.g., read counts in an RNA-Seq experiment) they usually do not translate into universal units such as molecules per cell or mole per milliliter.
# Normalization
Dealing with the lack of units in many biological assays
- try to make measurements comparable to each other using some 'arbitrary' but at least consistent units
- only consider endpoints where the units cancel out (e.g., fold change)
::: {.fragment}
Examples
- Sequence counts
- Mass spec intensities
- Fluorescence intensities (e.g. CODEX)
:::
::: {.fragment}
Btw, terrible name. It has nothing to do with normal distribution, vector norm, or right angles.
Perhaps better would be "calibration".
:::
# What is a good normalization method?
:::{.columns}
:::{.column width="80%"}
![](fig/normalization.png){fig-align="center"}
:::
:::
- If the normalization is 'off', we can have increased variability between replicates
- ... and/or apparent systematic differences between different conditions that are not real
- $\to$ false positives, false negatives
What do we want from a good normalization method:
- remove technical variation
- but *keep* biological variation
Possible figure of merit?
- signal-to-noise ratio, computed from pos. and neg. controls, replicates
# Occam's razor
::: {.columns}
::: {.column width="40%"}
![William of Ockham](fig/William_of_Ockham.png)
:::
:::{.column width="60%"}
If one can explain a phenomenon without assuming this or that hypothetical entity, there is no ground for assuming it.
One should always opt for an explanation in terms of the fewest possible causes, factors, or variables.
:::
:::
# Regular and catastrophic noise
::: {.columns}
::: {.column width="60%"}
Regular noise: can be modeled by simple probability models (independent normal, Poisson distributions; or mixtures such as
Gamma-Poisson or Laplace)
Can use relatively straightforward methods to take regular noise into account
Real world is more complicated: measurements can be completely off the scale (sample swap, contamination, software bug, ...)
Multiple errors often come together: e.g., a whole microtiter plate went bad, affecting all data measured from it.
:::
::: {.column width="40%"}
![](fig/Train_wreck_at_Montparnasse_1895.jpg)
:::
:::
Such events are hard to model or even correct for---our best chance is data quality assessment, outlier detection and documented removal.
# RNAi screen example
<div style="text-align: center;">
![](fig/RNAiScreen.png){width="55%"}
</div>
# Keeping track: Dailies
::: {.columns}
::: {.column width="30%"}
![](fig/dailies_icon.png){fig-align="right" width="50%"}
:::
::: {.column width="70%"}
A film director will view daily takes, to correct potential lighting, shooting issues before they affect too much footage. It is a good idea not to wait until all the runs of an experiment have been finished before looking at the data.
:::
:::
Intermediate data analyses and visualizations will track eventual unexpected sources of variation in
the data and enable you to adjust the protocol.
It is important to be aware of sources of variation as they occur and adjust for them.
# Vary one factor at a time
::: {.columns}
::: {.column width="33%"}
![Ibn Sina (Avicenna)](fig/Avicenna.png){width="100%"}
:::
::: {.column width="33%"}
![](fig/medicus.jpeg){height="75%"}
:::
::: {.column width="30%"}
[Ibn Sina's](https://en.wikipedia.org/wiki/Avicenna) *Canon of Medicine (1020)* lists seven rules of experimental design, including the need for controls and replication, the danger of confounding and the necessity of changing only one factor at a time.
:::
:::
::: {.fragment}
This dogma was overthrown in the 20th century by RA Fisher.
:::
# A toy example: two-group comparison
```{r}
#| label: illustrateboxplots
#| echo: false
require("gridExtra")
library("ggplot2")
library("dplyr")
n1 = 6L
n2 = 24L
df0 = tibble(
condition = factor(paste("condition", rep(c(1, 2), n1))),
batch = factor(paste( rep(c(1, 2), n1))),
y = c(1.5,2)[as.integer(condition)] + rnorm(2*n1, 0, 0.1)
)
df1 = tibble(
condition = factor(paste("condition", rep(c(1, 2), n1))),
batch = factor(seq_len(2*n1) %% 3),
y = c(1.5,2)[as.integer(condition)] + c(-0.5, 0, 0.5)[as.integer(batch)] + rnorm(2*n1, 0, 0.3)
)
df2 = tibble(
condition = factor(paste("condition", rep(c(1, 2), n2))),
batch = factor(seq_len(2*n2) %% 3),
y = c(1.5,2)[as.integer(condition)] + c(-0.5, 0, 0.5)[as.integer(batch)] + rnorm(2*n2, 0, 0.3)
)
myplot = function(x)
ggplot(x, aes(x = condition, y = y)) +
geom_boxplot(alpha = 0.5, col = "blue") +
geom_point(size = 2, alpha = 0.75) + xlab("")
```
```{r p0p0eff, fig.width = 4, fig.height = 2, out.width="80%"}
p0 = myplot(df0)
ms0 = summarize(group_by(df0, condition), m = median(y))
p0effect = p0 + geom_segment(
data = ms0, aes(y = m[1], yend = m[2]),
x = 1.5, xend = 1.5, col="red",
arrow = arrow(length = unit(0.25, "cm"), ends = "both", type = "closed"))
grid.arrange(p0, p0effect, ncol = 2)
```
However, suppose the experiment was done in two "batches", and we color the data according that:
```{r p0p0batch, fig.width = 5, fig.height = 2, out.width="80%"}
p0batch = p0 + aes(x = condition, y = y, color = batch)
grid.arrange(p0, p0batch, ncol = 2, widths = c(0.44, 0.56))
```
We cannot conclude because of **confounding**.
Now suppose there is no such fatal confounding, but an unknown batch effect that is balanced between the conditions,
and effectively causes higher noise levels.
Then, the previous sample size (2 x `r n1`) is not enough. With 2 x `r n2`, it looks better:
```{r, fig.width = 4, fig.height = 2, out.width="80%"}
p1 = myplot(df1)+ ylim(c(0, 3.1))
p1batch = p1 + aes(x = condition, y = y, color = batch)
p2 = myplot(df2)
grid.arrange(p1 + xlab(paste("p =", format.pval(t.test(y ~ condition, data = df1)$p.value, digits = 2))),
p2 + xlab(paste("p =", format.pval(t.test(y ~ condition, data = df2)$p.value, digits = 2))),
ncol = 2)
```
# Lessons from the toy example
- Success of an experiment: detect a difference iff it is truly there, and for the right reasons
- Spread (sd, var) matters as much as the effect size
Depends on:
1) Effect size (**unchangeable**)
2) Control and documentation of all factors (block effects, date / time / operator etc.).
3) Noise (irreducible variability in measurements): $\sigma$
4) Sample sizes: remember standard error of mean is $\sim\frac{\sigma}{\sqrt{n}}$
# Decomposition of variability: analysis of variance (ANOVA)
::: {layout-ncol=1}
```{r}
#| label: block1
#| fig-width: 5
#| fig-height: 2
#| out.width: "100%"
grid.arrange(p1, p1batch, ncol=2, widths = c(0.44, 0.56))
```
:::
```{r}
#| label: block2
#| echo: TRUE
#| code-line-numbers: FALSE
lm(y ~ condition, data = df1) |> anova()
```
$\quad$
```{r}
#| label: block3
#| echo: TRUE
#| code-line-numbers: FALSE
lm(y ~ condition + batch, data = df1) |> anova()
```
# Blocking: the case of paired experiments.
::: {.columns}
::: {.column width="40%"}
![ZeaMays](fig/maizeDarwin.png){width="100%"}
:::
::: {.column width="55%"}
Darwin's *Zea Mays* experiment:
```{r}
#| echo: !expr 2
data(darwin.maize, package= "agridat")
head(darwin.maize)
with(darwin.maize, table(type, pair))
```
- 15 pairs of plants, each with two different pollination methods, across 4 pots.
- a **balanced design**: all the different factor combinations have the same number of observation replicates.
- particularly easy to analyse
:::
:::
## Paired t-test is equivalent to a 2-way ANOVA
```{r}
#| label: zeamays
#| echo: TRUE
#| code-line-numbers: FALSE
with(darwin.maize,
t.test(height[type == "self"], height[type == "cross"], paired = TRUE))
fit = lm(height ~ type + pair, data = darwin.maize)
anova(fit)
coef(fit)["typeself"]
```
# t-test comments
The proof that the $t$-statistic follows a certain $t$-distribution assumes that observations are independent and follow a normal distribution: this is a sufficient, but not necessary, condition.
<span style="color:red">Deviation from normality</span> (heavier tails): test typically maintains type-I error control (it may loose some power).
Options: use simulations or permutations; use a different test (e.g., Wilcoxon)
<span style="color:red">Deviation from independence</span>: type-I error control is lost, p-values will likely be totally wrong (e.g., for positive correlation, too optimistic).
No easy options:
- try to model the dependence (multivariate analysis)
- remove it
- empirical null (Efron et al.)
[Demo shiny app](http://127.0.0.1:5794)
# "Block what you can, randomize what you cannot"
(George Box, 1978)
Often we don't know which nuisance factors will be important, or we
cannot plan for them ahead of time.
In such cases, randomization is a practical strategy: at least in the limit of large enough sample size, the effect of any nuisance factor should average out.
## Randomized Complete Block Design
![](fig/CompleteRB3.png){width="50%"}
"The design space is divided into uniform units to account for any variation so that observed differences within units are largely due to true differences between treatments. Treatments are then assigned at random to the subjects in the blocks - once in each block. The defining feature of the Randomized Complete Block Design is that each block sees each treatment exactly once." ([Trudi Grant](https://pbgworks.org/sites/pbgworks.org/files/RandomizedCompleteBlockDesignTutorial.pdf))
## Randomization decreases bias
- Humans are bad at assigning treatments truly at random
- Random assignment reduces unconscious bias
(special samples treated differently, "balancing things out", ...)
- Randomization also helps with unknown nuisance factors.
## Randomization helps inference
- if the sample is randomly generated from a population, we can infer
something about the population we drew from
## Random does not mean haphazardly
- Need to use a random number generator and a seed.
There is also rich literature on deterministic, balanced block designs.
# Thinking of computational methods like we do of biological assays
Many biological assays are so fiddly and fickle that we always use them together with positive and negative controls
(Western blots, CRISPR, PCR...)
With sufficiently complicated computational analyses, it can be worthwhile to think of them in the same way.
# How many replicates do I need?
In well-controlled experiment: rule of thumb, $n\le 3$.
In a study: dozens, hundreds.
If a factor of interest is continuous (e.g. time, temperature, concentration) and sampled at many points, there usually no need for replication within its levels.
# Effective sample size for dependent data.
A sample of independent observations is more informative than the same number of dependent observations. $\Rightarrow$
Dependent data require **larger** samples. Intuition: each sample contains **less** information.
Consider an opinion poll. We knock at people's doors and ask them a question.
- Scenario 1: pick $n$ people at $n$ random places throughout the country.
- Scenario 2: to save travel time, you pick $n/3$ random places and then at each of these interview three people who live next door to each other.
In both cases, $n$ people are polled is $n$.
But if we assume that people living in the same neighborhood are more likely to have similar opinions, the data from Scenario 2 are (positively) correlated.
```{r effective_sample_size_sim, fig.width = 4, fig.height = 2.5, out.width = "80%", message = FALSE, fig.cap = "Density estimates for the polling result using the two sampling methods. The correlated method has higher spread. The truth is indicated by the vertical line."}
library("tidyr")
doPoll = function(n = 100, numPeoplePolled = 12) {
opinion = sort(rnorm(n))
i1 = sample(n, numPeoplePolled)
i2 = sample(seq(3, n, by = 3), numPeoplePolled / 3)
i2 = c(i2, i2 - 1, i2 - 2)
c(independent = mean(opinion[i1]), correlated = mean(opinion[i2]))
}
responses = replicate(5000, doPoll())
ggplot(pivot_longer(as_tibble(t(responses)), cols = rownames(responses)),
aes(x = value, col = name)) + geom_density() +
geom_vline(xintercept = 0) + xlab("Opinion poll result")
```
Simulation: "opinions" in a population of `r formals(doPoll)$n` people are $N(0,1)$. We want to estimate the mean by sampling. In scenario 1, we ask `r formals(doPoll)$numPeoplePolled` randomly selected people. In scenario, we ask `r formals(doPoll)$numPeoplePolled / 3` randomly selected people as well as two of their direct neighbours.
<!--We model the spatio-sociological structure of our country by sorting the houses from most negative to most positive opinion in the first line of the `doPoll` function.-->
## Time course: equalize the dependence structure by taking more frequent samples when there is more "going on".
![Response to a transient perturbation](fig/SamplingDesignSmall.png){fig-align="center" width="60%"}
# Data transformations
\begin{equation}
X_{\text{obs}} = f(\theta) + \varepsilon
\end{equation}
- Homoskedastic: $\quad\text{sd}(\varepsilon)=\text{const.}$
- Heteroskedastic: e.g.,
+ multiplicative error: $\quad\text{sd}(\varepsilon)\sim f(\theta)$
+ Poisson noise: $\quad\text{var}(\varepsilon)\sim f(\theta)$
![](fig/delta_method_modified.png){fig-align="center" width="60%"}
- multiplicative error: $\log(X)$
- Poisson noise: $\sqrt X$
Examples:
- Fluorescence based assays: additive + multiplicative [@RockeDurbin:2001]
- RNA-Seq: Poisson + multiplicative.
# Data quality assessment and quality control
- QA: measure and monitor data quality
- QC: remove bad data
Pervade all phases of analysis: raw data, transformation, summarization, model fitting, hypothesis testing, screening for ''hits''. Diagnostics and metrics:
- Feature value range (Boxplot, ECDF) across samples
- Joint distributions (scatter plots, heatmaps, sample-sample or feature-feature correlation matrices)?
- Correlation of replicates vs between biological conditions
- Evidence of batch effects (categorical, continuous)—--explicitly known or latent.
![](fig/mv-v19-2274-Fig2.png){fig-align="center" width="80%"}
- [iSEE](https://doi.org/doi:10.18129/B9.bioc.iSEE) package:
[PBMC4k example](https://marionilab.cruk.cam.ac.uk/iSEE_pbmc4k),
[CYTOF example](https://marionilab.cruk.cam.ac.uk/iSEE_cytof)
- [MatrixQCVis](https://doi.org/doi:10.18129/B9.bioc.MatrixQCvis) package:
::: {.columns}
::: {.column width="50%"}
![](fig/1896_Ford_Quadricycle.jpeg){fig-align="left"}
:::
::: {.column width="50%"}
Henry Ford (possibly apocryphal): [''If I had asked people what they wanted, they would have said faster horses.''](https://corporate.ford.com/history.html): quality as **fitness for purpose**, versus **adherence to specifications**.
Not easy to nail down (or mathematically define) quality, the word is used with many meanings. See also http://en.wikipedia.org/wiki/Quality_(business)
:::
:::
# Longitudinal Data
Longitudinal data have time as a covariate.
- handful of time points (e.g. response of a cell line measured 48h, 72h and 84h after exposure to a drug) vs a long and densely sampled time series (e.g. patch clamp data in electrophysiology or movie from life cell microscopy).
- in 1st case, think of time as just another experimental factor, like concentration or choice of drug. One analysis strategy could be to first identify the "best", or biologically most indicative, time point, and then focus on that. Or whether there is any effect at all, regardless of the time.
- in 2nd case, time series, we'll often want to fit dynamical models to the data. We have many choices:
- (Hidden) Markov models
- Change point detection
- Ordinary differential equations or reaction-diffusion models
- Piece-wise deterministic stochastic processes
- Autoregressive models
- Non-parametric smoothing followed by clustering or classification into prototypic shapes
# Don't Pretend You Are Dumb
- Pros and cons of "unbiased" approaches
Examples include:
- Clustering vs (semi-)supervised classification
- Multiple testing / hypothesis weighting ($\to$ independent hypothesis weighting, et al.)
- Feature scaling
# Dichotomization is bad
Grouping / discretization of continuous variables
::: {.columns}
::: {.column width="50%"}
![](fig/Garten-der-Lueste-rechts-unten.png){width="100%"}
:::
::: {.column width="50%"}
Dichotomizing continuous predictors in multiple regression: a bad idea<br>
[Royston, Altman, Sauerbrei<br>
Statistics in Medicine 2006](https://doi.org/10.1002/sim.2331)
:::
:::
# Leaky pipelines and statistical sufficiency
::: {.columns}
::: {.column width="40%"}
![](fig/leakypipeline.png){width="70%"}
:::
::: {.column width="55%"}
Data analysis pipelines in high-throughput biology often work as **funnels** that successively summarise and compress the data. In high-throughput sequencing, we may start with individual sequencing reads, then align them to a reference, then only count the aligned reads for each position, summarise positions to genes (or other kinds of regions), then "normalize" these numbers by library size, etc.
:::
:::
At each step, we loose some information, and it is important to make sure we still have enough information for the task at hand. The problem is particularly burning if we use a data pipeline built from a series of separate components without enough care being taken ensuring that all the information necessary for ulterior steps is conserved.
Statisticians have a concept for whether certain summaries enable the
reconstruction of all the relevant information in the data:
**sufficiency**. E.g., in a Bernoulli random experiment with a known
number of trials, $n$, the number of successes is a sufficient statistic
for estimating the probability of success $p$.
In a 4-state Markov chain (A,C,G,T) such as the one we saw in Lecture 2, what are the sufficient statistics for the estimation of the transition probabilities?
Iterative approaches akin to what we saw when we used the EM algorithm
can sometimes help avoid information loss. For instance, when analyzing
mass spectroscopy data, a first run guesses at peaks individually for
every sample. After this preliminary spectra-spotting, another iteration
allows us to borrow strength from the other samples to spot spectra that
may have been labeled as noise.
# Sharpen Your Tools: Reproducible Research
Analysis projects often begin with a simple script, exploration, prototyping.
- Many ideas are tried, dissmissed, adapted.
- More data come in, external datasets are integrated
- More people become involved.
Eventually, the paper needs to be written, figures be done properly, analysis be saved for the
scientific record and to document its integrity. Here are a few tools
that can help.
- **Use an integrated development environment (IDE):** RStudio/Posit; VS Code; IDLE...
- **Use literate programming** tools: Rmarkdown, quarto, JuPyteR.
- **Anticipate re-engineering of the data formats and the software.**
* **Use functions** rather than copy-pasting stretches of code.
* **Reorganize data** upfront rather than repeatedly on the fly.
* **Centralize** the location of the raw data files, streamline & automate the derivation of intermediate data.
- **Reuse existing tools.**
- **Use version control**, such as git.
- **Use the R package system.**
- **Keep a hyperlinked webpage with an index of all analyses.**
# Data representation
Combining all the data so it is ready for analysis or visualisation often involves a lot of shuffling around of the data, until they are in the right shape and format for an analytical algorithm or a graphics routine.
Errors can occur, lost labels, lost information: be safe, redundancy is good.
# Wide vs long table format
A single cell expression data matrix (Hiiragi2013 package) data (only the first 5 rows and columns):
```{r}
#| label: wide
#| message: FALSE
#| echo: !expr c(2:5)
library("SummarizedExperiment")
data("x", package = "Hiiragi2013")
dfw = as.data.frame(exprs(x))
dfw[1:5, 1:5]
dim(dfw)
```
This is an example for a data table in *wide format*.
```{r}
#| label: pivot_longer
#| echo: !expr c(3)
library("magrittr")
dfw$probeid = rownames(dfw)
dfl = pivot_longer(dfw,
cols = colnames(x),
names_to = "Embryonic Day") %T>% print
```
In `dfl`, each row corresponds to exactly one measured value, stored in the column named `value`. Then there are additional columns that store the associated covariates.
Now suppose we want to store somewhere not only the probe identifiers
but also the associated gene symbols.
In `dfw`, we could stick them as an additional column, and perhaps also throw in the genes' ENSEMBL identifier for good measure. But now we
have a problem: the dataframe now has some columns that
represent different samples, and others that refer to information for
all samples (the gene symbol and identifier) and we somehow have to
know this when interpreting the dataframe. This is what Hadley Wickham
calls **untidy data**.
In contrast, in `dfl`, we can add these columns, yet still know that each row forms exactly one observation, and all information associated with that observation is in the same row.
# Tidy data
In tidy data
1. each variable forms a column,
2. each observation forms a row,
3. each type of observational unit forms a table.
# Using tidy data wisely
See also https://www.huber.embl.de/msmb/Chap-Design.html#tidy-data-using-it-wisely
## Efficiency (lack of)
Even though there are only 45101 probe-gene symbol relationships, we are storing them 4555201 times in the rows of `dfl`.
## Standardization (lack of)
We may choose to call these columns `probeid` and `geneid`, but the next person might call them `probe_id` and `gene_id`, or even something completely different. When we find a dataframe that was made by someone else and that contains such-named columns, we may hope, but have no guarantees, that these are valid gene symbols.
Addressing such issues is behind the object-oriented design of the specialized data structures in Bioconductor, such as the class `SummarizedExperiment`.
## Matrices versus dataframes
Representation of the matrix may also simply be more natural and mathematically stringent - e.g., if we want to apply PCA / dimension reduction, or classification.
# Summary
There are two types of variation in an experiment or study: those of interest, those that are unwanted.
We usually cannot rid ourselves of all the unwanted variation, but we saw how we can used balanced randomized designs, data transformations.
Reproducible workflows
Open science: make data and code available
# Acknowledgments
::: {.columns}
::: {.column width="50%"}
![Susan Holmes](fig/microstats_holmes.jpg)
:::
:::{.column width="50%"}
![$\quad\quad\quad\quad\quad\quad\quad\quad$Julia Philipp](fig/EulsLXGS_400x400.jpg){width="65%"}
:::
:::
Quarto sources for this talk:
[https://github.com/wolfganghuber/Best-Analysis-Practices-Talk](https://github.com/wolfganghuber/Best-Analysis-Practices-Talk)
## References