Merge pull request #1913 from lldelisle/atac_update_pgt

ATAC-Seq update new version of pgt
galaxyproject · Jun 11, 2020 · 0ef8eb3 · 0ef8eb3
2 parents 9f66473 + 151349b
commit 0ef8eb3
Show file tree

Hide file tree

Showing 7 changed files with 1,497 additions and 1,256 deletions.
diff --git a/topics/epigenetics/images/atac-seq/pyGenomeTracksOutput.png b/topics/epigenetics/images/atac-seq/pyGenomeTracksOutput.png
diff --git a/topics/epigenetics/images/atac-seq/pyGenomeTracksOutput_100M_zo.png b/topics/epigenetics/images/atac-seq/pyGenomeTracksOutput_100M_zo.png
diff --git a/topics/epigenetics/images/atac-seq/pyGenomeTracksOutput_200M_zo.png b/topics/epigenetics/images/atac-seq/pyGenomeTracksOutput_200M_zo.png
diff --git a/topics/epigenetics/images/atac-seq/pyGenomeTracksOutput_20M.png b/topics/epigenetics/images/atac-seq/pyGenomeTracksOutput_20M.png
diff --git a/topics/epigenetics/images/atac-seq/pyGenomeTracksOutput_20M_zo.png b/topics/epigenetics/images/atac-seq/pyGenomeTracksOutput_20M_zo.png
diff --git a/topics/epigenetics/tutorials/atac-seq/tutorial.md b/topics/epigenetics/tutorials/atac-seq/tutorial.md
@@ -35,7 +35,7 @@ In this tutorial we will use data from the study of {% cite Buenrostro2013 %}, t
 
 ### When working with real data
 {:.no_toc}
-When you use your own data we suggest you to use [this workflow](https://usegalaxy.eu/u/heylf/w/atac-seq-gtm-with-control) which includes the same steps but is compatible with replicates. If you do not have any control data you can import and edit this workflow, removing all steps with the controls. Controls for the ATAC-Seq procedure are not commonly performed, as discussed [here](https://informatics.fas.harvard.edu/atac-seq-guidelines.html), but could be ATAC-Seq of purified DNA.
+When you use your own data we suggest you to use [this workflow](https://usegalaxy.eu/u/ldelisle/w/atac-seq-gtm-with-control-and-macs2) which includes the same steps but is compatible with replicates. If you do not have any control data you can import and edit this workflow, removing all steps with the controls. Controls for the ATAC-Seq procedure are not commonly performed, as discussed [here](https://informatics.fas.harvard.edu/atac-seq-guidelines.html), but could be ATAC-Seq of purified DNA.
 
 > ### Agenda
 >
@@ -77,7 +77,7 @@ We first need to download the sequenced reads (FASTQs) as well as other annotati
 >
 >    {% include snippets/add_tag.md %}
 >
-> 4. Check that the datatype of the 2 FASTQ files is `fastqsanger.gz` and the BED file is `bed`. If they are not then change the datatype as described below.
+> 4. Check that the datatype of the 2 FASTQ files is `fastqsanger.gz` and the peak file (ENCFF933NTR.bed.gz) is `encodepeak`. If they are not then change the datatype as described below.
 >
 >    {% include snippets/change_datatype.md datatype="datatypes" %}
 >
@@ -87,8 +87,8 @@ We first need to download the sequenced reads (FASTQs) as well as other annotati
 > If you are not familiar with FASTQ format, see the [Quality Control tutorial]({% link topics/sequence-analysis/tutorials/quality-control/tutorial.md %})
 {: .comment}
 >
-> ### {% icon comment %} BED format
-> If you are not familiar with BED format, see the [BED Format](https://genome.ucsc.edu/FAQ/FAQformat.html)
+> ### {% icon comment %} BED / encode narrowPeak format
+> If you are not familiar with BED format or encode narrowPeak format, see the [BED Format](https://genome.ucsc.edu/FAQ/FAQformat.html)
 {: .comment}
 
 We will visualise regions later in the analysis and obtain the gene information now. We will get information for chromosome 22 genes (names of transcripts and genomic positions) using the UCSC tool.
@@ -194,7 +194,10 @@ The first step is to check the quality of the reads and the presence of the Next
 > > >
 > > >    3. **Overrepresented sequences**
 > > >
-> > >       Nextera adapter sequences are observable in the **Adapter Content** section.
+> > >       One sequence is over represented: 
+> > >       you have 306 reads which are exactly the sequence of the Nextera adapter.
+> > >       They correspond to adapters amplified head-to-head.
+> > >       306 is really low (only 0.1% of reads).
 > > >
 > > {: .solution}
 > >
@@ -274,7 +277,7 @@ The forward and reverse adapters are slightly different. We will also trim low q
 {: .hands_on}
 
 > ### {% icon comment %} FastQC Results
-> If we run FastQC again we should see under **Adapter Content** that the Nextera adapters are no longer present.
+> If we run FastQC again we should see under **Overrepresented sequences** that there is no more overrepresented sequences and under **Adapter Content** that the Nextera adapters are no longer present.
 > ![FastQC screenshot on the adapter content section after cutadapt](../../images/atac-seq/Screenshot_fastqcAftercutadapt.png "FastQC screenshot on the adapter content section after cutadapt")
 {: .comment}
 
@@ -394,7 +397,7 @@ High numbers of mitochondrial reads can be a problem in ATAC-Seq. Some ATAC-Seq
 >
 > ![Samtools idxstats result](../../images/atac-seq/Screenshot_samtoolsIdxStatsChrM.png "Samtools idxstats result")
 >
-> There are 221 000 reads which map to chrM and 170 000 which map to chr22.
+> There are 220 000 reads which map to chrM and 165 000 which map to chr22.
 {: .tip}
 
 ## Filter Duplicate Reads
@@ -645,16 +648,6 @@ In order to visualise a specific region (e.g. the gene *RAC2*), we can either us
 >
 {: .hands_on}
 
-### Convert the Genrich peaks to BED
-At the moment, **pyGenomeTracks** does not deal with the datatype encodepeak which is a special bed.
-
-> ### {% icon hands_on %} Hands-on: Change the datatype
-> 1. Change the datatype of the output of **Genrich** from encodepeak to bed.
->
->    {% include snippets/change_datatype.md datatype="bed" %}
->
-{: .hands_on }
-
 ## Create heatmap of coverage at TSS with deepTools
 
 You might also be interested in specific regions. For this, you can compute a heatmap. We will use the **deepTools plotHeatmap**. As an example, we will here make a heatmap centered on the transcription start sites (TSS).
@@ -711,9 +704,9 @@ We will now generate a heatmap. Each line will be a transcript. The coverage wil
 > > >
 > > > MACS2 coverage is very simple, each 5' is extended 100bp (+/-50bp).
 > > > Genrich coverage is evaluated in a more subtle way: if the fragment length is above 100 (the expension size), the coverage will be each 5' extended 100bp (+/-50bp), but if it is less, the coverage will be between each 5' extended 50bp (-50bp - fragment size - + 50bp):
-> > > ![macs2 vs genrich](../../images/atac-seq/Screenshot_macs2vsGenrich.png "macs2 vs genrich coverage")
+> > > ![MACS2 vs Genrich](../../images/atac-seq/Screenshot_macs2vsGenrich.png "MACS2 vs Genrich coverage")
 > > > In this example, we see on the left a pair with a long fragment size: both algorithm behave the same.
-> > > On the left a pair with a short fragment size: Genrich reports only one interval joining both extremities wheareas macs2 will still report 2 intervals even if they overlap.
+> > > On the left a pair with a short fragment size: Genrich reports only one interval joining both extremities wheareas MACS2 will still report 2 intervals even if they overlap.
 > > {: .tip}
 > {: .solution}
 >
@@ -731,49 +724,48 @@ We will now generate a heatmap. Each line will be a transcript. The coverage wil
 >                - *"Plot title"*: `Coverage from Genrich (extended +/-50bp)`
 >                - {% icon param-file %} *"Track file bigwig format"*: Select the output of **Wig/BedGraph-to-bigWig** {% icon tool %} called `Genrich bigwig`.
 >                - *"Color of track"*: Select the color of your choice
+>                - *"Minimum value"*: 0
 >                - *"height"*: `5`
 >                - *"Show visualization of data range"*: `Yes`
->                - *"Include spacer at the end of the track"*: `0.5`
 >        - {% icon param-repeat %} *"Insert Include tracks in your plot"*
->            - *"Choose style of the track"*: `Gene track / Bed track`
+>            - *"Choose style of the track"*: `NarrowPeak track`
 >                - *"Plot title"*: `Peaks from Genrich (extended +/-50bp)`
->                - {% icon param-file %} *"Track file bed format"*: Select the output of **Genrich** {% icon tool %} (the one you converted from encodepeak to bed).
+>                - {% icon param-file %} *"Track file bed format"*: Select the output of **Genrich** {% icon tool %}.
 >                - *"Color of track"*: Select the color of your choice
->                - *"height"*: `3`
->                - *"Plot labels"*: `No`
->                - *"Include spacer at the end of the track"*: `0.5`
+>                - *"display to use"*: `box: Draw a box`
+>                - *"Plot labels (name, p-val, q-val)"*: `No`
 >        - {% icon param-repeat %} *"Insert Include tracks in your plot"*
 >            - *"Choose style of the track"*: `Bigwig track `
->                - *"Plot title"*: `Coverage from macs2 (extended +/-50bp)`
+>                - *"Plot title"*: `Coverage from MACS2 (extended +/-50bp)`
 >                - {% icon param-file %} *"Track file bigwig format"*: Select the output of **Wig/BedGraph-to-bigWig** {% icon tool %} called `MACS2 bigwig`.
 >                - *"Color of track"*: Select the color of your choice
+>                - *"Minimum value"*: 0
 >                - *"height"*: `5`
 >                - *"Show visualization of data range"*: `Yes`
->                - *"Include spacer at the end of the track"*: `0.5`
 >        - {% icon param-repeat %} *"Insert Include tracks in your plot"*
->            - *"Choose style of the track"*: `Gene track / Bed track`
->                - *"Plot title"*: `Peaks from macs2 (extended +/-50bp)`
+>            - *"Choose style of the track"*: `NarrowPeak track`
+>                - *"Plot title"*: `Peaks from MACS2 (extended +/-50bp)`
 >                - {% icon param-file %} *"Track file bed format"*: Select the output of **MACS2** {% icon tool %} (narrow Peaks).
 >                - *"Color of track"*: Select the color of your choice
->                - *"height"*: `3`
->                - *"Plot labels"*: `No`
->                - *"Include spacer at the end of the track"*: `0.5`
+>                - *"display to use"*: `box: Draw a box`
+>                - *"Plot labels (name, p-val, q-val)"*: `No`
 >        - {% icon param-repeat %} *"Insert Include tracks in your plot"*
 >            - *"Choose style of the track"*: `Gene track / Bed track`
 >                - *"Plot title"*: `Genes`
 >                - {% icon param-file %} *"Track file bed format"*: `chr22 genes`
 >                - *"Color of track"*: Select the color of your choice
 >                - *"height"*: `5`
->                - *"Include spacer at the end of the track"*: `0.5`
+>                - *"Put all labels inside the plotted region"*: `Yes`
+>                - *"Allow to put labels in the right margin"*: `Yes`
 >        - {% icon param-repeat %} *"Insert Include tracks in your plot"*
->            - *"Choose style of the track"*: `Gene track / Bed track`
+>            - *"Choose style of the track"*: `NarrowPeak track`
 >                - *"Plot title"*: `CTCF peaks`
 >                - {% icon param-file %} *"Track file bed format"*: Select the dataset `bedtools SortBED of ENCFF933NTR.bed.gz`
 >                - *"Color of track"*: Select the color of your choice
->                - *"Plot labels"*: `No`
->                - *"Include spacer at the end of the track"*: `0.5`
->    - *"Configure x-axis"*: `Yes`
->        - *"Where to place the x-axis"*: `Bottom`
+>                - *"display to use"*: `box: Draw a box`
+>                - *"Plot labels (name, p-val, q-val)"*: `No`
+>        - {% icon param-repeat %} *"Insert Include tracks in your plot"*
+>            - *"Choose style of the track"*: `X-axis`
 >
 > 2. Click on the {% icon galaxy-eye %} (eye) icon of the output.
 >
@@ -790,7 +782,7 @@ Unfortunately, Genrich does not work very well with our small training dataset (
 
 
 > ### {% icon question %} Questions
-> In the ATAC-Seq sample in this selected region we see four peaks detected by Genrich.
+> In the ATAC-Seq sample in this selected region we see four peaks detected by Genrich and MACS2.
 >
 > 1. How many TSS are accessible in the sample in the displayed region?
 > 2. How many CTCF binding loci are accessible?
@@ -808,9 +800,14 @@ Unfortunately, Genrich does not work very well with our small training dataset (
 >
 {: .question}
 
-We can see that in this region both peak calling perform the same. However, when zooming out, we see that macs2 is more sensitive:
+We can see that in this region both peak calling perform the same. However, when zooming out, we see that MACS2 is more sensitive:
 ![pyGenomeTracks output for 20 million of pairs on the whole genome zoom out](../../images/atac-seq/pyGenomeTracksOutput_20M_zo.png "pyGenomeTracks output for 20 million of pairs on the whole genome zoom out").
 
+When the number of reads increases, the number of peaks with MACS2 increases but the number of peaks with Genrich decreases:
+![pyGenomeTracks output for 100 million of pairs on the whole genome zoom out](../../images/atac-seq/pyGenomeTracksOutput_100M_zo.png "pyGenomeTracks output for 100 million of pairs on the whole genome zoom out").
+![pyGenomeTracks output for 200 million of pairs on the whole genome zoom out](../../images/atac-seq/pyGenomeTracksOutput_200M_zo.png "pyGenomeTracks output for 200 million of pairs on the whole genome zoom out").
+
+
 As CTCF binds so ubiquitously and by itself can displace the nucleosome creating accessible regions, a region containing a peak with no corresponding CTCF peak or TSS could be a putative enhancer. In the pyGenomeTracks plot we see a region like this located in the intron of a gene and another one between genes. However, it is impossible to guess from the position which would be the gene controlled by this region. And of course, more analyses are needed to assess if it is a real enhancer, for example, histone ChIP-seq, 3D structure, transgenic assay, etc.