Skip to content
This repository has been archived by the owner on Sep 8, 2018. It is now read-only.

Commit

Permalink
Merge pull request #1 from mgonzalezporta/2013.12-Norwich
Browse files Browse the repository at this point in the history
2013.12 norwich
  • Loading branch information
mgonzalezporta committed Dec 16, 2013
2 parents 3178b2d + 7f0b2f8 commit 1b74d43
Show file tree
Hide file tree
Showing 25 changed files with 339 additions and 109 deletions.
28 changes: 17 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ This tutorial will illustrate how to use standalone tools, together with R and B

We will be working with a subset of a publicly available dataset from *Drosophila melanogaster*, which is available both in the Short Read archive ([SRP001537](http://www.ebi.ac.uk/ena/data/view/SRP001537) - raw data) and in Bioconductor ([pasilla package](http://www.bioconductor.org/packages/release/data/experiment/html/pasilla.html) - processed data). For more information about this dataset please refer to the original publication ([Brooks et al. 2010](http://genome.cshlp.org/content/early/2010/10/04/gr.108662.110)).

The tools and R packages that we will be using during the practical are listed below (see [Software requirements](https://github.com/mgonzalezporta/TeachingMaterial#software-requirements)) and the necessary data files can be found [here](http://www.ebi.ac.uk/~mar/courses/RNAseq.tar.gz). After dowloading and uncompressing the `tar.gz` file, you should have the following directory structure in your computer:
The tools and R packages that we will be using during the practical are listed below (see [Software requirements](https://github.com/mgonzalezporta/TeachingMaterial#software-requirements)) and the necessary data files can be found [here](http://www.ebi.ac.uk/~mar/courses/RNAseq_all.tar.gz). After dowloading and uncompressing the `tar.gz` file, you should have the following directory structure in your computer:

```
RNAseq
Expand All @@ -29,14 +29,19 @@ This work is licensed under a [Creative Commons Attribution-ShareAlike 3.0 Unpor
1. [The SAM/BAM format](doc/21.bam.md)
1. [Visualising aligned reads](doc/22.visualising.md)
1. [Filtering BAM files](doc/23.filtering_bam.md)
1. [Counting reads overlapping annotated genes](doc/24.counting.md)
* With htseq-count
* With R
* Alternative approaches
1. [Normalising counts](doc/25.normalising.md)
* With RPKMs
* With DESeq
1. [Differential gene expression](doc/26.de.md)
2. Gene-centric analyses:
1. [Counting reads overlapping annotated genes](doc/24.counting.md)
* With htseq-count
* With R
* Alternative approaches
1. [Normalising counts](doc/25.normalising.md)
* With RPKMs
* With DESeq2
1. [Differential gene expression](doc/26.de.md)
2. Exon-centric analyses:
1. [Differential exon usage](doc/27.deu.md)
2. Transcript-centric analyses:
1. [Identification, annotation and visualisation of splicing switch events](doc/28.se.md)

## Software requirements
*Note: depending on the topics covered in the course some of these tools might not be used.*
Expand All @@ -54,7 +59,8 @@ This work is licensed under a [Creative Commons Attribution-ShareAlike 3.0 Unpor
* [Rsamtools](http://www.bioconductor.org/packages/release/bioc/html/Rsamtools.html)
* [biomaRt](http://www.bioconductor.org/packages/release/bioc/html/biomaRt.html)
* [pasilla](http://www.bioconductor.org/packages/release/data/experiment/html/pasilla.html)
* [DESeq](http://www.bioconductor.org/packages/release/bioc/html/DESeq.html)
* [DESeq2](http://www.bioconductor.org/packages/2.13/bioc/html/DESeq2.html)
* [DEXSeq](http://www.bioconductor.org/packages/2.13/bioc/html/DEXSeq.html)

## Other resources

Expand All @@ -73,4 +79,4 @@ This work is licensed under a [Creative Commons Attribution-ShareAlike 3.0 Unpor


## Aknowledgments
This tutorial has been inspired on material developed by Ângela Gonçalves, Nicolas Delhomme, Simon Anders and Martin Morgan, who I would like to thank and acknowledge. Special thanks must go to Ângela Gonçalves, with whom I started teaching, and Gabriella Rustici, for always finding a way to organise a new course.
This tutorial has been inspired on material developed by Ângela Gonçalves, Nicolas Delhomme, Simon Anders and Martin Morgan, who I would like to thank and acknowledge. Special thanks must go to Ângela Gonçalves and Mitra Barzine, with whom I have been teaching, and to Gabriella Rustici, for always finding a way to organise a new course.
6 changes: 3 additions & 3 deletions doc/12.qa.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ fastqc SRR031714_1.fastq.gz SRR031714_2.fastq.gz
As a result we will obtain the file `filename_fastqc.zip`, which will be automatically unzipped in the `filename_fastqc` directory. There we will find the QA report (`fastqc_report.html`), which provides summary statistics about the numbers of reads, base calls and qualities, as well as other information (you will find a detailed explanation of all the plots in the report in the [project website](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3\%20Analysis\%20Modules/)).

**Exercise:** The information provided by the QA report will be very useful when deciding on the options we want to use in the filtering step. After checking it, can you come up with some criteria for the filtering of our file (i.e. keeping/discarding reads based on a specific quality threshold)?
[Solution](https://github.com/mgonzalezporta/TeachingMaterial/blob/master/solutions/_qa_ex1.md)
[Solution](../solutions/_qa_ex1.md)

**Exercise:** As we have seen in the previous section, fastq files contain information on the quality of the read sequence. The reliability of each nucleotide in the read is measured using the Phred quality score, which represents the probability of an incorrect base call:

Expand All @@ -22,8 +22,8 @@ As a result we will obtain the file `filename_fastqc.zip`, which will be automat
where `Q` is the Phred quality value and `P` the probability of error. For example, a Phred quality score of 20 would indicate a probability of error in the base call of 1 in 100 (i.e. 99% accuracy). If you inspect the fastq file again though, you will see that this information is not displayed in number format, but is encoded in a set of characters.
During the filtering step, we will be using tools that read these characters and transform them into quality values, so we need to be sure first about the encoding format used in our data (either phred 33 or phred 64).
Using the information provided in the QA report (under the *per base sequence quality* section) and in the Wikipedia entry for the [FASTQ format](http://en.wikipedia.org/wiki/FASTQ_format), can you guess which encoding format was used?
[Solution](https://github.com/mgonzalezporta/TeachingMaterial/blob/master/solutions/_qa_ex2.md)
[Solution](../solutions/_qa_ex2.md)

**Exercise:** As we have seen, a visual interpretation of the QA report is a very useful practice when dealing with HTS data. However, it becomes a very tedious task if we are working with huge volumes of data (imagine we have 1000 fastq files to inspect!). Thankfully, the developers of FastQC have thought of that. Can you spot any alternative output of this software that we could use in this situation?
[Solution](https://github.com/mgonzalezporta/TeachingMaterial/blob/master/solutions/_qa_ex3.md)
[Solution](../solutions/_qa_ex3.md)

22 changes: 11 additions & 11 deletions doc/13.filtering_fastq.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
## Filtering FASTQ files
After analysing the QA report, one might want to discard some of the reads based on several criteria, such as quality and nucleotide composition. We will use two different tools to perform these filtering steps: [PRINSEQ](http://prinseq.sourceforge.net/) and [fastqc-mcf](https://code.google.com/p/ea-utils/).
After analysing the QA report, one might want to discard some of the reads based on several criteria, such as quality and nucleotide composition. We will use two different tools to perform these filtering steps: [PRINSEQ](http://prinseq.sourceforge.net/) and [fastq-mcf](https://code.google.com/p/ea-utils/).

PRINSEQ offers a wide range of options for filtering and we can learn about them in the manual:

Expand Down Expand Up @@ -28,24 +28,24 @@ zcat SRR031714_2.fastq.gz | prinseq-lite \
```

**Exercise:** Which are the criteria that we are using to discard reads?
[Solution](https://github.com/mgonzalezporta/TeachingMaterial/blob/master/solutions/_filtering_fastq_ex1.md)
[Solution](../solutions/_filtering_fastq_ex1.md)

**Exercise:** Which extra option should we have had to use if our files had been in phred 64 format?
[Solution](https://github.com/mgonzalezporta/TeachingMaterial/blob/master/solutions/_filtering_fastq_ex2.md)
[Solution](../solutions/_filtering_fastq_ex2.md)

**Exercise:** After filtering the fastq file it is not a bad idea to obtain a QA report again to decide whether we are happy with the results. Do you think we got rid of the main issues spotted initially?
[Solution](https://github.com/mgonzalezporta/TeachingMaterial/blob/master/solutions/_filtering_fastq_ex3.md)
[Solution](../solutions/_filtering_fastq_ex3.md)

**Exercise:** Usually it is also useful to keep track of the number of reads available in the fastq file both before and after the filtering step. Can you gather this information from the FastQC reports? Given that we are working with paired-end data, do you see any limitation?
Now imagine we were working with a bigger number of files; can you come up with a more automated way to check that?
[Solution](https://github.com/mgonzalezporta/TeachingMaterial/blob/master/solutions/_filtering_fastq_ex4.md)
[Solution](../solutions/_filtering_fastq_ex4.md)

The filtering step might become complicated if you are working with paired-end data, since you have to be sure that both fastq files (one for each read pair) contain the same number of reads in the same order. There are some tools available that simplify this task, for example, fastqc-mcf. We can print a list of the avaiable options just by typing in the name of the tool:
The filtering step might become complicated if you are working with paired-end data, since you have to be sure that both fastq files (one for each read pair) contain the same number of reads in the same order. There are some tools available that simplify this task, for example fastq-mcf. We can print a list of the avaiable options just by typing in the name of the tool:
```bash
fastqc-mcf
fastq-mcf
```

We observe that a main functionality of fastqc-mcf is to remove adapters from the fastq file. For this reason we need to provide a fasta file with the adapter sequences. In our case, we have decided to check only for the standard Illumina paired-end adapter:
We observe that a main functionality of fastq-mcf is to remove adapters from the fastq file. For this reason we need to provide a fasta file with the adapter sequences. In our case, we have decided to check only for the standard Illumina paired-end adapter:
```bash
cat adapters.fa
```
Expand All @@ -58,10 +58,10 @@ fastq-mcf adapters.fa SRR031714_1.fastq.gz SRR031714_2.fastq.gz \
```

**Exercise:** How does the QA report look like this time? Do we have the same number of reads in each file?
[Solution](https://github.com/mgonzalezporta/TeachingMaterial/blob/master/solutions/_filtering_fastq_ex5.md)
[Solution](../solutions/_filtering_fastq_ex5.md)

**Exercise:** Something else that one might want to check is the read length. How long were our reads before we started with the filtering step? How long are they now?
[Solution](https://github.com/mgonzalezporta/TeachingMaterial/blob/master/solutions/_filtering_fastq_ex6.md)
[Solution](../solutions/_filtering_fastq_ex6.md)

Depending on the filtering options used, we might end up with a set of reads with different lengths. *A priori*, this should not be a limitation, but we might encounter some difficulties in the downstream analyses, depending on the tools we want to use. For this reason, if we have a clear idea of what we want to do with the data, it is always a good practice to check the requirements for the tools that we are planning to use before taking any steps. If that is not the case, in order to be on the safe side, we can use filtering options that affect all reads equally, eg:

Expand All @@ -80,7 +80,7 @@ zcat SRR031714_1.fastq.gz | prinseq-lite \
```

**Exercise:** Let us generate the QA reports one last time. How do they compare to the previous ones?
[Solution](https://github.com/mgonzalezporta/TeachingMaterial/blob/master/solutions/_filtering_fastq_ex7.md)
[Solution](../solutions/_filtering_fastq_ex7.md)

In conclusion, there are many filtering combinations that you can apply, and the specific options will largely depend on the type of data and the posterior analyses. We recommend to check the [PRINSEQ manual](http://prinseq.sourceforge.net/manual.html) for a nice overview on the topic.

2 changes: 1 addition & 1 deletion doc/14.demultiplexing.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ In order to separate the reads in 4 different fastq files (one for each barcode/
fastq-multx
```

Although we already kow wehere our barcodes are located within the read, from the documentation we observe that fastq-multx will attempt to guess the position for us. Let us try the automatic guessing with the following command:
Although we already know where our barcodes are located within the read, from the documentation we observe that fastq-multx will attempt to guess the position for us. Let us try the automatic guessing with the following command:

```bash
fastq-multx barcodes.txt barcoded.fastq -o %.barcoded.fastq
Expand Down
52 changes: 32 additions & 20 deletions doc/25.normalising.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,39 +31,51 @@ A common way to normalise reads is to convert them to RPKMs (or FPKMs in the cas

[Solution](https://github.com/mgonzalezporta/TeachingMaterial/blob/master/solutions/_normalising_ex1.md)

Such a count normalisation is suited for visualisation, but sub-optimal for further analyses. A better way of normalising the data is to use either the *edgeR* or *DESeq* packages.
Such a count normalisation is suited for visualisation, but sub-optimal for further analyses. A better way of normalising the data is to use either the *edgeR* or *DESeq2* packages.

### With DESeq
RPKM normalisation is not the most adequate for certain types of downstream analysis (e.g. differential gene expression), given that it is susceptible to library composition biases. There are many other normalisation methods that should be considered with that goal in mind (see [Dillies et al. 2012](http://bib.oxfordjournals.org/content/early/2012/09/15/bib.bbs046.long) for a comparison). In this section we are going to explore the one offered within the *DESeq* package:
### With DESeq2
RPKM normalisation is not the most adequate for certain types of downstream analysis (e.g. differential gene expression), given that it is susceptible to library composition biases. There are many other normalisation methods that should be considered with that goal in mind (see [Dillies et al. 2012](http://bib.oxfordjournals.org/content/early/2012/09/15/bib.bbs046.long) for a comparison). In this section we are going to explore the one offered within the *DESeq2* package:

```rconsole
library(DESeq)
# create an object of class CountDataSet, which is the data container used by DESeq
# remember we have already loaded the pasilla dataset in the previous section
cds = newCountDataSet(counts(pasillaGenes), phenoData(pasillaGenes)$condition)
cds
# the CountDataSet class in a container for high-throughput assays and experimental metadata
# the counts can be accessed with the counts function
head(counts(cds))
# and the phenotypic data with the pData function
pData(cds)
library("DESeq2")
library("pasilla")
# load the count data
# you can skip this if you have already loaded the data in the previous section
data("pasillaGenes")
countData=counts(pasillaGenes)
# load the experimental design
colData=data.frame(condition=pData(pasillaGenes)[,c("condition")])
# create an object of class DESeqDataSet, which is the data container used by DESeq2
dds=DESeqDataSetFromMatrix(
countData = countData,
colData = colData,
design = ~ condition)
colData(dds)$condition=factor(colData(dds)$condition,
levels=c("untreated","treated"))
dds
# the DESeqDataSet class is a container for the information we just provided
head(counts(dds))
colData(dds)
design(dds)
```

In order to normalise the raw counts we will start by determining the relative library sizes, or *size factors* of each library. For example, if the counts of the expressed genes in one sample are, on average, twice as high as in another, the size factor for the first sample should be twice as large as the one for the other sample. These size factors can be obtained with the function *estimateSizeFactors*:
In order to normalise the raw counts we will start by determining the relative library sizes, or *size factors* for each library. For example, if the counts of the expressed genes in one sample are, on average, twice as high as in another, the size factor for the first sample should be twice as large as the one for the other sample. These size factors can be obtained with the function *estimateSizeFactors*:

```rconsole
cds=estimateSizeFactors(cds)
sizeFactors(cds)
dds=estimateSizeFactors(dds)
sizeFactors(dds)
```

Once we have this information, the normalised data is obtained by dividing each column of the count table by the corresponding size factor. We can perform this calculation by calling the function counts with a specific argument as follows:

```rconsole
deseq_ncounts=counts(cds, normalized=TRUE)
deseq_ncounts=counts(dds, normalized=TRUE)
```

**Exercise:** We have now accumulated three different versions of the same dataset: the raw counts (`counts`), the RPKM quantifications (`rpkm`) and the DESeq normalised counts (`deseq_ncounts`). How would you visualise the performance of each normalisation method in getting rid of the variation that does not associate to the experimental conditions that are being evaluated?
[Solution](https://github.com/mgonzalezporta/TeachingMaterial/blob/master/solutions/_normalising_ex2.md)

Loading

0 comments on commit 1b74d43

Please sign in to comment.