-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathoverview_hts.qmd
93 lines (69 loc) · 9.33 KB
/
overview_hts.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
---
title: "Overview High-throughput Sequencing"
---
### DNA Sequencing
#### Sanger sequencing
- With Sanger's invention of what is now called *Sanger sequencing* (1977) it became possible to read off the sequence of DNA molecules. Typically, one can read several hundreds of bases from one end towards the other end.
- For Sanger sequence, one needs a huge amount of copies of the same DNA fragments.
- Fortunately, it is easy to exponentially amplify DNA with the polymerase-chain reaction (PCR) which is able to duplicate DNA molecules and so doubles the number of molecules in each reaction cycle.
- Availability of commercial Sanger sequencers (from 1984 on) with read-out via fluorescent light (in 4 colours) made it possible to sequence, e.g., the entire human genome ("completed" 2003).
- Sanger sequencing works in solution. The liquid solution must contain copies of only one sequence; as one otherwise gets an unreadable mixture of sequencing signals from the different sequences.
#### High-throughput sequencing
- In the early 2000s, ideas came up to fix the DNA fragments to a surface and perfrom the DNA such that the copies attach to the surface in clsoe proximity.
- Thus, one can spread many different DNA fragments over the surface and amplify each fragment into a "cluster" of copies. As the surface offers space for millions of spatially separated clusters, one can obtain the sequences of millions of different DNA fragments in parallel.
- Such high-throughput sequencing (HTS) machines became available in the 2000s and have sense continuously improved in throughput and cost effectiveness.
- The method described above has been commercialized by Solexa, which was then bought by Illumina. It was not the first, but one of the first HTS methods. Illumina is still market leader for HTS machines.
- Now, it is possible to obtain a human subject's (nearly) complete genome within two days for a few hundred euro.
- The approach is to obtain DNA from many cells and fragment it into many pieces of a few hundred base-pairs (bp) length.
- The number of sequencing reads that overlap a give base pair in the genome is Poisson distributed.
- The expected number of fragments (the "coverage" rate) depends on the total number of sequencing reads obtained.
- Therefore, a trade-off has to be made between cost-effectiveness and completeness.
#### Transcriptomics / RNA sequencing
- All cells of an individual have the same DNA, but they differ vastly in shape and function. (Compare a blood cell with a skin cell.)
- This is because different cells express (i.e., transcribe) different genes, because they need different proteins.
- Which genes a cell expresses and to what extent is regulated by complicated biochemical reactions.
- Hence, a cell's type and state is reflected to a good extent by its RNA content, also called its "transcriptome".
- Therefore, a comparably simple way to get a large amount of information about what is goin on in a biological sample is to extract the RNA from all the cells and sequence it.
- RNA can be readily transformed into complimentary DNA (using a process called reverse transcription, that can easily performed in vitro), and this DNA is the sequenced using HTS.
#### Analysis of RNA-Seq data
- When performin RNA sequencing, one obtains a very long list with sequences of the RNA fragments (the "reads"), typically as text file in FASTQ format.
- One then *aligns* the reads to the genome: A special program, called a *short-read aligner* is used to search for the sequence of each read in the genome (i.e., in a reference assembly). Of course, the aligner has to search through the chromosome sequences and their reverse complements.
- The aligner should be splice-aware, i.e., expect that a read might strecth over an intron and hence align to two places with a gap in between.
- The aligner produces a file (typically in SAM format) that give for each read the position in the genome where the sequence matched. Typically, the chromosome name is given, the position in the chromosome (as an integer) where the read's match starts, the strand ("+" for match to the sequence as given, "-" for match to the reverse complement), followed by information on skipped parts (introns), descripancies between reference and read (mismatched bases) etc.
- A portion of the reads will stay "unmapped" (i.e., the aligner could not find their sequence in the genome), and some reads might match several loci (positions) in the genome.
- Using a genome annotation file (which contains the genetic coordinates of all genes with their transcripts and exons), another software then determines for each read which gene it is from.
- In the end, we get for each gene a *count*: a number telling us how many sequencing reads were seen that mapped to this gene.
#### Bulk RNA-Seq
- Usually one performs RNA-Seq in order to *compare* the transcriptome of differend types of samples.
- For example, one might compare tumor samples with samples from adjacent normal tissues. In order to make sure that results are general and not specific to on subject, one needs sample pairs from several, or, ideally, many, patients. Genes whose expression is constistently stronger in the tumor samples than in the corresponding normal samples might be the genes that cause the tumour's malignancy.
- Other useful comparisons might be
- Compare tissue samples from mice that have been treated with some drug with "control mice" which have received a placebo, to see how the drug influences the cells' function.
- Compare samples from different tissues, to understand the difference between them.
- and many more
- In most cases, the goal is to find *differentiaklly expressed genes* (DGEs), i.e., genes whose expression strength differs in a consistent (i.e., statistically significant) manner between the sample groups
#### Single-cell RNA-Seq
- Tissues comprise many cell types.
- For example, skin contain keratinocyte (which make up the skin's sturdy outer layer), other epithelial cells (comprising e.g. the soft lower layer), cells making up the hair follicles, even muscle cells that can raise the hairs and, of course, the endothelial cells that line the blood capillaries and the various blood cells inside these, as well as receptor and nerve cells so that we can sense touch, heat, cold and pain etc.
- If we find genes that are differentially expressed between, say, healthy skin and skin affected by a disease -- which of these cell types is affected? Is the gene upregulated (i.e., expressed more strongly) in all cell types or only in some.
- We would need to know which of the sequencing reads come from which cell.
In *single-cell RNA-Seq* we can tell whether two reads come from the same cell or from two different cells.
There are at least three possible ways to achieve this.
- Plate-based methods: A microtiter plate is used (a grid-like arrangement of many miniature "test tubes" called "wells") and a flow cytometer is used to place exactly one cell into each well.
- Microfluidic methods: An emulsion of water droplets in oil is produced using microfluidics such that each water-in-oil droplet contains (ideally) at most one cell.
- Combinatorial indexing: We skip the explanation for these.
Each compartment (well or droplet) gets a "DNA barcode oligo": a short piece of DNA with a random sequence that gets incorporated into the cDNA during reverse transcription. Thus, each sequenced fragment contains teh sequence of the RNA fragment, together with this "cell barcode" that identifies the compartment (and hence the cell) that the read came from.
We will get back to details on how this works.
#### Preprocessing of single-cell RNA-Seq data
- As before, each reads gets aligned to the genome. Afterwards, the gene that the read maps to is identified.
- Additionally, the cell barcode is analysed. A list of all cell barcodes that appear in the sample is generated, and each read is assigned to one element in this list.
- The final result is a *count matrix*:
- The rows of the matrix correspond to genes.
- The columns correspond to cell barcodes and hence cells.
- The matrix entries tell how many reads have been found that mapped to this gene and seemed to originated from the cell with this barcode.
There are a number of complications that we will need to discuss:
- A barcode may be corrupted by a read error, causing a read to be assigned to the wrong cell.
- A compartment may not have contained a cell but still contained a few RNA molecules that were present in the solution, stemming from destroyed cells with broken cell membrane. Such barcode will have very few reads assigned to them, and we have to recognize them as not being cells.
- A compartment might have contained more than one cell, and thus a mixture of RNA from several cells of possibly different type. This is called a "doublet".
The main issue, however, is: The process is very lossy. A typical mammalian cell contains a several hundred thousands of mRNA molecules, but we usually only get a few thousand (microfluidics methods) or at best up to around 20,000 reads (plate-based methods).
Which molecules get sequenced is a random process, and we need to take care of the "counting noise" thus produced.
Many genes are produced by the cells at very low copy numbers, and with the selection during sequencing, we will very frequently end up with a zero count. The count matrix is therefore very *sparse* (i.e., a large fraction of the matrix elements have the value 0).