design.qmd

---
title: 'Good scientific practice: <a href="https://www.huber.embl.de/msmb/Chap-Design.html">Design of High Throughput Experiments and their Analysis</a>'
author: Susan Holmes, Wolfgang Huber
date: 2023-01-17
date-format: iso
format:
  revealjs: 
    theme: [default, GHGA.scss]
    logo: fig/GHGA_short_Logo_orange.png
    transition: slide
    scrollable: true
    slide-number: c/t
    show-slide-number: all
    auto-stretch: false
# auto-stretch is a huge source of grief for slide layout, see https://quarto.org/docs/presentations/revealjs/advanced.html --> stretch
  html:
    code-line-numbers: false
execute: 
  echo: false
  warning: true
  error: false
  message: false
slide-level: 1
bibliography: refs.bib
editor_options: 
  chunk_output_type: console
---

# Why are we here?

```{r initialize}
## Set default options for code chunks
knitr::opts_chunk$set(cache = TRUE, autodep = TRUE)
set.seed(0xbedada)
```

::: {layout-ncol=2}

![](fig/Youngronaldfisher2.JPG){width="40%" fig-align="center"}

<font size=+2>
*To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.*<br>
Presidential Address to the First Indian Statistical Congress, 1938. Sankhya 4, 14-17.  
</font>

R.A. Fisher, Pioneer of Statistics and Experimental Design

:::

# Design of High Throughput Experiments 

::: {layout="[[70,30]]"}

This lecture is based on Chapter 13 of the book<br>
*Modern Statistics for Modern Biology*<br>
by Susan Holmes, Wolfgang Huber<br>
[https://www.huber.embl.de/msmb/](https://www.huber.embl.de/msmb/)

![](fig/book-cover-photo.jpg){width="10%" fig-align="center"} 

:::

# Topics

-   Resource allocation & experimental design: a matter of trade-offs & good judgement
-   Dealing with the different types of variability; partitioning variability
-   Experiments, studies, ...
-   Power, sample size and efficiency.
-   Transformations
-   Things to worry about: dependencies, batch effects, unwanted variation.
-   Compression, redundancy and sufficiency
-   Computational best practices

![](fig/Garten-der-Lueste-mitte-oben.png){width="100%"}

# The Art of "Good Enough"

- Experimental design rationalizes the tradeoffs imposed by having finite resources. 
- Our measurement instruments have limited resolution and precision; often we don't know these at the outset and have to collect preliminary data providing estimates. 
- Sample sizes are limited for practical, economic, and sometimes ethical reasons. 
- We may only be able to observe the phenomenon of interest indirectly rather than directly. 
- Our measurements may be overlaid with nuisance factors over which we have limited control. 
- There is little point in prescribing unrealistic ideals: we need to make pragmatic
choices that are feasible.


# Accuracy vs precision
![](fig/accuracy-vs-bias.png)

# Types of studies / experiments

## Experiment

- everything is exquisitely controlled

## Retrospective observational studies 

- we take what we get, opportunistic; no control over study participants, assignment of important factors, confounding 

## Prospective, controlled studies 

- e.g., clinical trials
- randomization, blinding
- ethical constraints (incl. money and time).

## Meta-analysis

- We did not design the experiments or studies ourselves, nor collect the data.
- Retrospective analysis of data that already happen to exist.


# Illustration: experiment

Well-characterized cell line growing in laboratory conditions on defined media, temperature
and atmosphere. 

We administer a precise amount of a drug, and after
72h we measure the activity of a specific pathway reporter. 

![](fig/blog-banner_2017_feb_lab-comfort_3840x1200-2560x800.jpg)

# Illustration: challenges with studies

::: {.columns}

::: {.column width="60%"}
We recruit patients with a chronic disease and who fulfill 
inclusion criteria, and we ask them to take a drug each day exactly at 6 am. 
Before, and after 3 months, we perform a diagnostic test. 

- People may forget to take the pill or take it at the wrong time. 
- Some may feel that the disease got worse and stop taking the drug.

:::

::: {.column width="40%"}
![](fig/Cutting-stone-of-madness.png)
:::

:::

- Some may feel that the disease got better and stop taking the drug. 
- Some may lead a healthy life-style, others drink, smoke and eat junk food. 
- Maybe it's not one disease but several
- And all of these factors may be correlated with each other in complex ways.

What to do about this?

::: {.fragment}
- need much larger sample sizes than with an experiment
- randomization, blinding, documentation
:::

# Biological versus technical replicates

Imagine we want to test whether a weight loss drug works. Which of the
following designs is better?

- A person is weighed on milligram precision scales, with 20 replicates. He/she follows the diet, and four weeks later, (s)he is weighed again, with 20 replicates. The data are recorded by a specially trained technician.

- 20 people weigh themselves on their bathroom scales and self-report the number. Four weeks later, they weigh themselves and report again. 

Some design decisions in HT biology are similar, if more subtle:

- sequencing libraries
- cell lines
- CRISPR guides
- drugs


# Quiz

For reliable variant calling with current sequencing technology, you need ca. $30\times$ coverage for a human genome.

In the *1000 Genomes Project*, the average depth of the data produced was 5.1 for 1092 individuals. Why was that study design chosen?

$\quad$

$\quad$

![](fig/Garten-der-Lueste-mitte-unten.png){width="80%" fig-align="center"}

# What exactly is a technical versus biological replicate?

In fact, the terminology is too simplistic. Variation can occur at many different levels:

- different CRISPR guide RNAs for the same target gene 
- different drugs for the same target protein
- different cell line (organoid, animal) models for the same biological system
- different measurement technologies (e.g., multiplexed immunohistochemistry vs imaging mass cytometry)
- different machines for the same technology 
- different variants of the protocol
- different technician, lab

We use [blocking](https://en.wikipedia.org/wiki/Blocking_(statistics)) and can use the tools of design matrices in linear models to record and account for it.

To the extent that we know about such nuisance factors and have kept track of them, we can (should) include them explicitly as bias terms in our analysis models. 

If we did not keep track, we can try to use latent factor models (SVA, PEER, RUV) to identify them from the data. 


# Error models: Noise is in the eye of the beholder

The efficiency of most biochemical or physical processes involving nucleic acid polymers depends on their sequence content, for instance, occurrences of long homopolymer stretches, palindromes, GC content. 

These effects are not universal, but can also depend on factors like concentration, temperature, which enzyme is used, etc. 

When looking at RNA-Seq data, should we treat GC content as noise or as bias? 

![](fig/portrait-of-shi-xie-and-his-studio.png){width="80%" fig-align="center"}

# One person's noise can be another's bias

::: {.columns}

::: {.column width="10%"}
![](fig/HeadsorTails.jpg){fig-align="right" width="70%"}
:::

::: {.column width="90%"}
We may think that the outcome of tossing a coin is completely random. 
:::

:::

If we meticulously registered the initial conditions of the
coin flip and solved the mechanical equations, we could predict which
side has a higher probability of coming up: noise becomes bias. 

<div style="text-align: center;"> 
![Tosser](fig/cointosser3.png){width="50%"}
[P. Diaconis, S. Holmes, et al.](https://news.stanford.edu/pr/2004/diaconis-69.html)
</div>

We use mathematical and probabilistic modelling as a method to bring some order into our ignorance---but keep in mind that such models are an expression of our subjective
ignorance, not an objective property of the world. 


# A lack of units

::: {.columns}

::: {.column width="60%"}
Pre-modern measurement systems measured lengths in feet, ells, inches (first joint of an index finger), weights in stones, volumes in multiples of the size of a wine jar, etc.  

In the [International System of Units](https://en.wikipedia.org/wiki/International_System_of_Units), meters, seconds, kilograms, amperes, ... are defined based on universal physical constants. 
:::

::: {.column width="40%"}
![](fig/2420_titelblatt_mathematikbuch.jpg)
:::

:::

There is no reliance on artefacts, and a meter measured by a lab in Canada using one instrument has the same meaning as a meter measured a year later by a lab in Heidelberg using a different instrument, by a space probe in the Kuiper belt, or by an extraterrestrial intelligence on Proxima Cen b.

Measurements in biology are, unfortunately, rarely that comparable. 

Often, absolute values are not reported (these would require units), but only fold changes with regard to some local reference. 

Even when absolute values exist (e.g., read counts in an RNA-Seq experiment) they usually do not translate into universal units such as molecules per cell or mole per milliliter.

# Normalization

Dealing with the lack of units in many biological assays

- try to make measurements comparable to each other using some 'arbitrary' but at least consistent units
- only consider endpoints where the units cancel out  (e.g., fold change)

::: {.fragment}
Examples

-  Sequence counts 
-  Mass spec intensities
-  Fluorescence intensities (e.g. CODEX)

:::

::: {.fragment}
Btw, terrible name. It has nothing to do with normal distribution, vector norm, or right angles. 
Perhaps better would be "calibration".
::: 

# What is a good normalization method?

:::{.columns}

:::{.column width="80%"}
![](fig/normalization.png){fig-align="center"}
:::

:::

- If the normalization is 'off', we can have increased variability between replicates
- ... and/or apparent systematic differences between different conditions that are not real
- $\to$ false positives, false negatives

What do we want from a good normalization method:

- remove technical variation
- but *keep* biological variation

Possible figure of merit?

- signal-to-noise ratio, computed from pos. and neg. controls, replicates

# Occam's razor

::: {.columns}

::: {.column width="40%"}
![William of Ockham](fig/William_of_Ockham.png)
:::

:::{.column width="60%"}

If one can explain a phenomenon without assuming this or that hypothetical entity, there is no ground for assuming it.

One should always opt for an explanation in terms of the fewest possible causes, factors, or variables.

:::

:::


# Regular and catastrophic noise

::: {.columns}

::: {.column width="60%"}

Regular noise: can be modeled by simple probability models (independent normal, Poisson distributions; or mixtures such as
Gamma-Poisson or Laplace) 

Can use relatively straightforward methods to take regular noise into account

Real world is more complicated: measurements can be completely off the scale (sample swap, contamination, software bug, ...)

Multiple errors often come together: e.g., a whole microtiter plate went bad, affecting all data measured from it. 

:::

::: {.column width="40%"}
![](fig/Train_wreck_at_Montparnasse_1895.jpg)
:::

:::

Such events are hard to model or even correct for---our best chance is data quality assessment, outlier detection and documented removal.

# RNAi screen example

<div style="text-align: center;"> 
![](fig/RNAiScreen.png){width="55%"}
</div>


# Keeping track: Dailies

::: {.columns}

::: {.column width="30%"}
![](fig/dailies_icon.png){fig-align="right" width="50%"}

:::

::: {.column width="70%"}

A film director will view daily takes, to correct potential lighting, shooting issues before they affect too much footage. It is a good idea not to wait until all the runs of an experiment have been finished before looking at the data.
:::

:::


Intermediate data analyses and visualizations will track eventual unexpected sources of variation in
the data and enable you to adjust the protocol. 

It is important to be aware of sources of variation as they occur and adjust for them.


# Vary one factor at a time

::: {.columns}

::: {.column width="33%"}
![Ibn Sina (Avicenna)](fig/Avicenna.png){width="100%"}
:::

::: {.column width="33%"}
![](fig/medicus.jpeg){height="75%"}
:::

::: {.column width="30%"}
[Ibn Sina's](https://en.wikipedia.org/wiki/Avicenna) *Canon of Medicine (1020)*  lists seven rules of experimental design, including the need for controls and replication, the danger of confounding and the necessity of changing only one factor at a time.
:::

:::

::: {.fragment}
This dogma was overthrown in the 20th century by RA Fisher.
:::

# A toy example: two-group comparison

```{r}
#| label: illustrateboxplots
#| echo: false
require("gridExtra")
library("ggplot2")
library("dplyr")
n1 = 6L
n2 = 24L

df0 = tibble(
  condition = factor(paste("condition", rep(c(1, 2), n1))),
  batch     = factor(paste(             rep(c(1, 2), n1))),
  y = c(1.5,2)[as.integer(condition)] + rnorm(2*n1, 0, 0.1)
)

df1 = tibble(
  condition = factor(paste("condition", rep(c(1, 2), n1))),
  batch = factor(seq_len(2*n1) %% 3),
  y = c(1.5,2)[as.integer(condition)] + c(-0.5, 0, 0.5)[as.integer(batch)] + rnorm(2*n1, 0, 0.3)
)

df2 = tibble(  
  condition = factor(paste("condition", rep(c(1, 2), n2))),
  batch = factor(seq_len(2*n2) %% 3),
  y = c(1.5,2)[as.integer(condition)] + c(-0.5, 0, 0.5)[as.integer(batch)] + rnorm(2*n2, 0, 0.3)
)

myplot = function(x)
  ggplot(x, aes(x = condition, y = y)) + 
      geom_boxplot(alpha = 0.5, col = "blue") +
      geom_point(size = 2, alpha = 0.75)  + xlab("") 

```

```{r p0p0eff, fig.width = 4, fig.height = 2, out.width="80%"}
p0  = myplot(df0)
ms0 = summarize(group_by(df0, condition), m = median(y))
p0effect = p0 + geom_segment(
  data = ms0, aes(y = m[1], yend = m[2]),
  x = 1.5,  xend = 1.5, col="red",
  arrow = arrow(length = unit(0.25, "cm"), ends = "both", type = "closed"))
grid.arrange(p0, p0effect, ncol = 2)
```

However, suppose the experiment was done in two "batches", and we color the data according that:

```{r p0p0batch, fig.width = 5, fig.height = 2, out.width="80%"}
p0batch = p0 + aes(x = condition, y = y, color = batch) 
grid.arrange(p0, p0batch, ncol = 2, widths = c(0.44, 0.56))
```

We cannot conclude because of **confounding**.

Now suppose there is no such fatal confounding, but an unknown batch effect that is balanced between the conditions, 
and effectively causes higher noise levels.
Then, the previous sample size (2 x `r n1`) is not enough. With 2 x `r n2`, it looks better:

```{r, fig.width = 4, fig.height = 2, out.width="80%"}
p1 = myplot(df1)+ ylim(c(0, 3.1))
p1batch = p1 + aes(x = condition, y = y, color = batch) 

p2 = myplot(df2)   

grid.arrange(p1 + xlab(paste("p =", format.pval(t.test(y ~ condition, data = df1)$p.value, digits = 2))), 
             p2 + xlab(paste("p =", format.pval(t.test(y ~ condition, data = df2)$p.value, digits = 2))), 
             ncol = 2)
```


# Lessons from the toy example

 - Success of an experiment: detect a difference iff it is truly there, and for the right reasons
 - Spread (sd, var) matters as much as the effect size
 
Depends on: 

1) Effect size (**unchangeable**)
2) Control and documentation of all factors (block effects, date / time / operator etc.).
3) Noise (irreducible variability in measurements): $\sigma$
4) Sample sizes: remember standard error of mean is $\sim\frac{\sigma}{\sqrt{n}}$
 
# Decomposition of variability: analysis of variance (ANOVA)

::: {layout-ncol=1}

```{r}
#| label: block1
#| fig-width: 5
#| fig-height: 2
#| out.width: "100%" 
grid.arrange(p1, p1batch, ncol=2, widths = c(0.44, 0.56))
``` 

:::

```{r}
#| label: block2
#| echo: TRUE
#| code-line-numbers: FALSE
lm(y ~ condition, data = df1) |> anova()
```
$\quad$
```{r}
#| label: block3
#| echo: TRUE
#| code-line-numbers: FALSE
lm(y ~ condition + batch, data = df1) |> anova()
```

# Blocking: the case of paired experiments.

::: {.columns}

::: {.column width="40%"}

![ZeaMays](fig/maizeDarwin.png){width="100%"} 
:::

::: {.column width="55%"}

Darwin's *Zea Mays* experiment:

```{r}
#| echo: !expr 2
data(darwin.maize, package= "agridat")
head(darwin.maize)
with(darwin.maize, table(type, pair))
```

- 15 pairs of plants, each with two different pollination methods, across 4 pots. 
- a **balanced design**: all the different factor combinations have the same number of observation replicates. 
- particularly easy to analyse

:::

:::

## Paired t-test is equivalent to a 2-way ANOVA

```{r}
#| label: zeamays
#| echo: TRUE
#| code-line-numbers: FALSE
with(darwin.maize, 
  t.test(height[type == "self"], height[type == "cross"], paired = TRUE))
  
fit = lm(height ~ type + pair, data = darwin.maize) 
anova(fit)
coef(fit)["typeself"]
```

# t-test comments

The proof that the $t$-statistic follows a certain $t$-distribution assumes that observations are independent and follow a normal distribution: this is a sufficient, but not necessary, condition.

<span style="color:red">Deviation from normality</span> (heavier tails): test typically maintains type-I error control (it may loose some power).

Options: use simulations or permutations; use a different test (e.g., Wilcoxon)

<span style="color:red">Deviation from independence</span>: type-I error control is lost, p-values will likely be totally wrong (e.g., for positive correlation, too optimistic).

No easy options: 

- try to model the dependence (multivariate analysis)
- remove it 
- empirical null (Efron et al.) 

[Demo shiny app](http://127.0.0.1:5794)

# "Block what you can, randomize what you cannot" 

(George Box, 1978)

Often we don't know which nuisance factors will be important, or we
cannot plan for them ahead of time. 

In such cases, randomization is a practical strategy: at least in the limit of large enough sample size, the effect of any nuisance factor should average out.

## Randomized Complete Block Design
![](fig/CompleteRB3.png){width="50%"}

"The design space is divided into uniform units to account for any variation so that observed differences within units are largely due to true differences between treatments. Treatments are then assigned at random to the subjects in the blocks - once in each block. The defining feature of the Randomized Complete Block Design is that each block sees each treatment exactly once." ([Trudi Grant](https://pbgworks.org/sites/pbgworks.org/files/RandomizedCompleteBlockDesignTutorial.pdf))

## Randomization decreases bias

 - Humans are bad at assigning treatments truly at random 
 - Random assignment reduces unconscious bias 
    (special samples treated differently, "balancing things out", ...)
 - Randomization also helps with unknown nuisance factors.    

## Randomization helps inference

 - if the sample is randomly generated from a population, we can infer
   something about the population we drew from
  
## Random does not mean haphazardly
 - Need to use a random number generator and a seed.
   

There is also rich literature on deterministic, balanced block designs.


# Thinking of computational methods like we do of biological assays

Many biological assays are so fiddly and fickle that we always use them together with positive and negative controls
(Western blots, CRISPR, PCR...)

With sufficiently complicated computational analyses, it can be worthwhile to think of them in the same way.


# How many replicates do I need?

In well-controlled experiment: rule of thumb, $n\le 3$. 

In a study: dozens, hundreds.

If a factor of interest is continuous (e.g. time, temperature, concentration) and sampled at many points, there usually no need for replication within its levels.

# Effective sample size for dependent data.

A sample of independent observations is more informative than the same number of dependent observations. $\Rightarrow$ 
Dependent data require **larger** samples. Intuition: each sample contains **less** information.
    
Consider an opinion poll. We knock at people's doors and ask them a question. 

- Scenario 1: pick $n$ people at $n$ random places throughout the country. 
- Scenario 2: to save travel time, you pick $n/3$ random places and then at each of these interview three people who live next door to each other. 

In both cases, $n$ people are polled is $n$.
But if we assume that people living in the same neighborhood are more likely to have similar opinions, the data from Scenario 2 are (positively) correlated.  

```{r effective_sample_size_sim, fig.width = 4, fig.height = 2.5, out.width = "80%", message = FALSE, fig.cap = "Density estimates for the polling result using the two sampling methods. The correlated method has higher spread. The truth is indicated by the vertical line."}
library("tidyr")
doPoll = function(n = 100, numPeoplePolled = 12) {
  opinion = sort(rnorm(n))
  i1 = sample(n, numPeoplePolled)
  i2 = sample(seq(3, n, by = 3), numPeoplePolled / 3)
  i2 = c(i2, i2 - 1, i2 - 2)
  c(independent = mean(opinion[i1]), correlated = mean(opinion[i2]))
}
responses = replicate(5000, doPoll())
ggplot(pivot_longer(as_tibble(t(responses)), cols = rownames(responses)), 
       aes(x = value, col = name)) + geom_density() +
  geom_vline(xintercept = 0) + xlab("Opinion poll result")
```

Simulation: "opinions" in a population of `r formals(doPoll)$n` people are $N(0,1)$. We want to estimate the mean by sampling. In scenario 1, we ask `r formals(doPoll)$numPeoplePolled` randomly selected people. In scenario, we ask `r formals(doPoll)$numPeoplePolled / 3` randomly selected people as well as two of their direct neighbours.

<!--We model the spatio-sociological structure of our country by sorting the houses from most negative to most positive opinion in the first line of the `doPoll` function.-->

##  Time course: equalize the dependence structure by taking more frequent samples when there is more "going on".

![Response to a transient perturbation](fig/SamplingDesignSmall.png){fig-align="center" width="60%"}

# Data transformations

\begin{equation}
X_{\text{obs}} = f(\theta) + \varepsilon
\end{equation}

- Homoskedastic: $\quad\text{sd}(\varepsilon)=\text{const.}$
- Heteroskedastic: e.g., 

  + multiplicative error: $\quad\text{sd}(\varepsilon)\sim f(\theta)$
  + Poisson noise: $\quad\text{var}(\varepsilon)\sim f(\theta)$

![](fig/delta_method_modified.png){fig-align="center" width="60%"}

- multiplicative error: $\log(X)$
- Poisson noise: $\sqrt X$

Examples:

- Fluorescence based assays: additive + multiplicative [@RockeDurbin:2001] 
- RNA-Seq: Poisson + multiplicative.


# Data quality assessment and quality control

- QA: measure and monitor data quality
- QC: remove bad data

Pervade all phases of analysis: raw data, transformation, summarization, model fitting, hypothesis testing, screening for ''hits''. Diagnostics and metrics:

- Feature value range (Boxplot, ECDF) across samples
- Joint distributions (scatter plots, heatmaps, sample-sample or feature-feature correlation matrices)?
- Correlation of replicates vs between biological conditions
- Evidence of batch effects (categorical, continuous)—--explicitly known or latent.

![](fig/mv-v19-2274-Fig2.png){fig-align="center" width="80%"}

- [iSEE](https://doi.org/doi:10.18129/B9.bioc.iSEE) package:
[PBMC4k example](https://marionilab.cruk.cam.ac.uk/iSEE_pbmc4k), 
[CYTOF example](https://marionilab.cruk.cam.ac.uk/iSEE_cytof)

- [MatrixQCVis](https://doi.org/doi:10.18129/B9.bioc.MatrixQCvis) package:


::: {.columns}

::: {.column width="50%"}
![](fig/1896_Ford_Quadricycle.jpeg){fig-align="left"}

:::

::: {.column width="50%"}
Henry Ford (possibly apocryphal): [''If I had asked people what they wanted, they would have said faster horses.''](https://corporate.ford.com/history.html): quality as **fitness for purpose**, versus **adherence to specifications**.

Not easy to nail down (or mathematically define) quality, the word is used with many meanings. See also http://en.wikipedia.org/wiki/Quality_(business) 
:::

:::

# Longitudinal Data

Longitudinal data have time as a covariate. 

- handful of time points (e.g. response of a cell line measured 48h, 72h and 84h after exposure to a drug) vs a long and densely sampled time series (e.g. patch clamp data in electrophysiology or movie from life cell microscopy).

- in 1st case, think of time as just another experimental factor, like concentration or choice of drug. One analysis strategy could be to first identify the "best", or biologically most indicative, time point, and then focus on that. Or whether there is any effect at all, regardless of the time. 

- in 2nd case, time series, we'll often want to fit dynamical models to the data. We have many choices:
  -   (Hidden) Markov models
  -   Change point detection
  -   Ordinary differential equations or reaction-diffusion models
  -   Piece-wise deterministic stochastic processes
  -   Autoregressive models
  -   Non-parametric smoothing followed by clustering or classification into prototypic shapes

# Don't Pretend You Are Dumb

- Pros and cons of "unbiased" approaches 

Examples include:

- Clustering vs (semi-)supervised classification

- Multiple testing / hypothesis weighting ($\to$ independent hypothesis weighting, et al.) 

- Feature scaling

# Dichotomization is bad

Grouping / discretization of continuous variables
 
::: {.columns}

::: {.column width="50%"}
![](fig/Garten-der-Lueste-rechts-unten.png){width="100%"}
:::

::: {.column width="50%"}
Dichotomizing continuous predictors in multiple regression: a bad idea<br>
[Royston, Altman, Sauerbrei<br>
Statistics in Medicine 2006](https://doi.org/10.1002/sim.2331)
:::
:::


# Leaky pipelines and statistical sufficiency

::: {.columns}

::: {.column width="40%"}
![](fig/leakypipeline.png){width="70%"}
:::

::: {.column width="55%"}
Data analysis pipelines in high-throughput biology often work as **funnels** that successively summarise and compress the data. In high-throughput sequencing, we may start with individual sequencing reads, then align them to a reference, then only count the aligned reads for each position, summarise positions to genes (or other kinds of regions), then "normalize" these numbers by library size, etc. 
:::
:::

At each step, we loose some information, and it is important to make sure we still have enough information for the task at hand. The problem is particularly burning if we use a data pipeline built from a series of separate components without enough care being taken ensuring that all the information necessary for ulterior steps is conserved.

Statisticians have a concept for whether certain summaries enable the
reconstruction of all the relevant information in the data:
**sufficiency**. E.g., in a Bernoulli random experiment with a known
number of trials, $n$, the number of successes is a sufficient statistic
for estimating the probability of success $p$.

In a 4-state Markov chain (A,C,G,T) such as the one we saw in Lecture 2, what are the sufficient statistics for the estimation of the transition probabilities?

Iterative approaches akin to what we saw when we used the EM algorithm
can sometimes help avoid information loss. For instance, when analyzing
mass spectroscopy data, a first run guesses at peaks individually for
every sample. After this preliminary spectra-spotting, another iteration
allows us to borrow strength from the other samples to spot spectra that
may have been labeled as noise.

# Sharpen Your Tools: Reproducible Research

Analysis projects often begin with a simple script, exploration, prototyping.

- Many ideas are tried, dissmissed, adapted.
- More data come in, external datasets are integrated
- More people become involved. 

Eventually, the paper needs to be written, figures be done properly, analysis be saved for the
scientific record and to document its integrity. Here are a few tools
that can help.

- **Use an integrated development environment (IDE):** RStudio/Posit; VS Code; IDLE...
- **Use literate programming** tools: Rmarkdown, quarto, JuPyteR. 
- **Anticipate re-engineering of the data formats and the software.** 
  * **Use functions** rather than copy-pasting stretches of code.
  * **Reorganize data** upfront rather than repeatedly on the fly.
  * **Centralize** the location of the raw data files, streamline & automate the derivation of intermediate data. 
- **Reuse existing tools.** 
- **Use version control**, such as git.
- **Use the R package system.** 
- **Keep a hyperlinked webpage with an index of all analyses.** 

# Data representation

Combining all the data so it is ready for analysis or visualisation often involves a lot of shuffling around of the data, until they are in the right shape and format for an analytical algorithm or a graphics routine. 

Errors can occur, lost labels, lost information: be safe, redundancy is good.


# Wide vs long table format

A single cell expression data matrix (Hiiragi2013 package) data (only the first 5 rows and columns):

```{r}
#| label: wide
#| message: FALSE
#| echo: !expr c(2:5)
library("SummarizedExperiment")
data("x", package = "Hiiragi2013")
dfw = as.data.frame(exprs(x))
dfw[1:5, 1:5]
dim(dfw)
```

This is an example for a data table in *wide format*.

```{r}
#| label: pivot_longer
#| echo: !expr c(3)
library("magrittr")
dfw$probeid = rownames(dfw)
dfl = pivot_longer(dfw, 
            cols = colnames(x),
            names_to = "Embryonic Day") %T>% print
```

In `dfl`, each row corresponds to exactly one measured value, stored in the column named `value`. Then there are additional columns that store the associated covariates.

Now suppose we want to store somewhere not only the probe identifiers
but also the associated gene symbols. 

In `dfw`, we could stick them as an additional column, and perhaps also throw in the genes' ENSEMBL identifier for good measure. But now we
have a problem: the dataframe now has some columns that
represent different samples, and others that refer to information for
all samples (the gene symbol and identifier) and we somehow have to
know this when interpreting the dataframe. This is what Hadley Wickham
calls **untidy data**. 

In contrast, in `dfl`, we can add these columns, yet still know that each row forms exactly one observation, and all information associated with that observation is in the same row.

# Tidy data 

In tidy data 

1.  each variable forms a column,

2.  each observation forms a row,

3.  each type of observational unit forms a table.

# Using tidy data wisely

See also https://www.huber.embl.de/msmb/Chap-Design.html#tidy-data-using-it-wisely

## Efficiency (lack of)

Even though there are only 45101 probe-gene symbol relationships, we are storing them 4555201 times in the rows of `dfl`. 

## Standardization (lack of)

We may choose to call these columns `probeid` and `geneid`, but the next person might call them `probe_id` and `gene_id`, or even something completely different. When we find a dataframe that was made by someone else and that contains such-named columns, we may hope, but have no guarantees, that these are valid gene symbols. 

Addressing such issues is behind the object-oriented design of the specialized data structures in Bioconductor, such as the class `SummarizedExperiment`.

## Matrices versus dataframes

Representation of the matrix may also simply be more natural and mathematically stringent - e.g., if we want to apply PCA / dimension reduction, or classification.

# Summary

There are two types of variation in an experiment or study: those of interest, those that are unwanted. 

We usually cannot rid ourselves of all the unwanted variation, but we saw how we can used balanced randomized designs, data transformations. 

Reproducible workflows

Open science: make data and code available


# Acknowledgments

::: {.columns}

::: {.column width="50%"}
![Susan Holmes](fig/microstats_holmes.jpg)
:::

:::{.column width="50%"}
![$\quad\quad\quad\quad\quad\quad\quad\quad$Julia Philipp](fig/EulsLXGS_400x400.jpg){width="65%"}
:::

:::

Quarto sources for this talk:
[https://github.com/wolfganghuber/Best-Analysis-Practices-Talk](https://github.com/wolfganghuber/Best-Analysis-Practices-Talk)


## References