Skip to content

Commit

Permalink
Merge branch 'develop' into 'master'
Browse files Browse the repository at this point in the history
Release 1.0.0

See merge request tron/tron-bam-preprocessing!2
  • Loading branch information
Pablo Riesgo Ferreiro committed Apr 15, 2021
2 parents 786b16f + 013bf4e commit 6f82d87
Show file tree
Hide file tree
Showing 19 changed files with 76,491 additions and 137 deletions.
8 changes: 8 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,9 @@
.idea
work
output
.nextflow*
report.html*
timeline.html*
trace.txt*
dag.dot*
*.swp
21 changes: 21 additions & 0 deletions .gitlab-ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
image: openjdk:11.0.10-jre-buster


before_script:
- java -version
- apt-get update && apt-get --assume-yes install wget make procps
- wget -qO- https://get.nextflow.io | bash && cp nextflow /usr/local/bin/nextflow
- nextflow help
- wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
- mkdir /root/.conda
- bash Miniconda3-latest-Linux-x86_64.sh -b && cp /root/miniconda3/bin/* /usr/local/bin/
- rm -f Miniconda3-latest-Linux-x86_64.sh
- conda --version

stages:
- test

test:
stage: test
script:
- make clean test
15 changes: 15 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
clean:
rm -rf output
#rm -rf work
rm -f report.html*
rm -f timeline.html*
rm -f trace.txt*
rm -f dag.dot*
rm -f .nextflow.log*
rm -rf .nextflow*

test:
nextflow main.nf -profile test,conda --output output/test1
nextflow main.nf -profile test,conda --skip_bqsr --output output/test2
nextflow main.nf -profile test,conda --skip_realignment --output output/test3
nextflow main.nf -profile test,conda --skip_deduplication --output output/test4
30 changes: 17 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# tron-bam-preprocessing
# TRONflow BAM preprocessing pipeline

Nextflow pipeline for the preprocessing of BAM files based on Picard and GATK.

Expand All @@ -9,24 +9,21 @@ In order to have a variant calling ready BAM file there are a number of operatio

GATK has been providing a well known best practices document on BAM preprocessing, the latest best practices for GATK4 (https://software.broadinstitute.org/gatk/best-practices/workflow?id=11165) does not perform anymore realignment around indels as opposed to best practices for GATK3 (https://software.broadinstitute.org/gatk/documentation/article?id=3238). This pipeline is based on both Picard and GATK. These best practices have been implemented a number of times, see for instance this implementation in Workflow Definition Language https://github.com/gatk-workflows/gatk4-data-processing/blob/master/processing-for-variant-discovery-gatk4.wdl.

At TRON we have a number of implementations of the BAM preprocessing pipeline, each one of those varies depending on the context. For instance, the script to run mutect has this pipeline embedded, see /code/iCaM/scripts/mutect.sh. This is repeated in some other places.

## Objectives

We aim at providing a single implementation of the BAM preprocessing pipeline that can be used across different situations. For this purpose there are some required steps and some optional steps. This is implemented as a Nextflow pipeline to simplify parallelization of execution in the cluster. The default configuration uses reference genome hg19, if another reference is needed the adequate resources must be provided. The reference genome resources for hg19 are installed in /projects/data/gatk_bundle/hg19 and they were downloaded from https://software.broadinstitute.org/gatk/download/bundle
We aim at providing a single implementation of the BAM preprocessing pipeline that can be used across different situations. For this purpose there are some required steps and some optional steps. This is implemented as a Nextflow pipeline to simplify parallelization of execution in the cluster. The default configuration uses reference genome hg19, if another reference is needed the adequate resources must be provided. The reference genome resources for hg19 were downloaded from https://software.broadinstitute.org/gatk/download/bundle

The input is a configuration file so multiple BAMs can run easily. The output is another tab-separated values file with the absolute paths of the preprocessed and indexed BAMs.
The input is a tab-separated values file where each line corresponds to one input BAM. The output is another tab-separated values file with the absolute paths of the preprocessed and indexed BAMs.

## Implementation

Steps:

* **Clean BAM**. Sets the mapping quality to 0 for all unmapped reads and avoids soft clipping going beyond the reference genome boundaries. Implemented in Picard
* **Reorder chromosomes**. Makes the chromosomes in the BAM follow the same order as the reference genome. Implemented in Picard
* **Sort by query name**. Ensuring the order by query name allows to find duplicates also in the unpaired and secondary alignment reads. Implemented in Picard
* **Add read groups**. GATK requires that some headers are adde to the BAM, also we want to flag somehow the normal and tumor BAMs in the header as some callers, such as Mutect2 require it. Implemented in Picard.
* **Mark duplicates** (optional). Identify the PCR and the optical duplications and marks those reads. Implemented in Picard
* **Sort by coordinates**. This order is required by all GATK tools. Implemented in Picard
* **Mark duplicates** (optional). Identify the PCR and the optical duplications and marks those reads. This uses the parallelized version on Spark, it is reported to scale linearly up to 16 CPUs.
* **Realignment around indels** (optional). This procedure is important for locus based variant callers, but for any variant caller doing haplotype assembly it is not needed. This is computing intensive as it first finds regions for realignment where there are indication of indels and then it performs a local realignment over those regions. Implemented in GATK3, deprecated in GATK4
* **Base Quality Score Recalibration (BQSR)** (optional). It aims at correcting systematic errors in the sequencer when assigning the base call quality errors, as these scores are used by variant callers it improves variant calling in some situations. Implemented in GATK4

Expand All @@ -35,13 +32,11 @@ Steps:
## How to run it

```
-bash-4.2$ nextflow main.nf --help
$ nextflow run tron-bioinformatics/tronflow-bam-preprocessing -r v1.0.0 --help
N E X T F L O W ~ version 19.07.0
Launching `bam_preprocessing.nf` [intergalactic_shannon] - revision: e707c77d7b
Launching `main.nf` [intergalactic_shannon] - revision: e707c77d7b
Usage:
bam_preprocessing.nf --input_files input_files
This workflow is based on the implementation at /code/iCaM/scripts/mutect.sh
main.nf --input_files input_files
Input:
* input_files: the path to a tab-separated values file containing in each row the sample name, sample type (eg: tumor or normal) and path to the BAM file
Expand All @@ -61,7 +56,15 @@ Optional input:
* skip_bqsr: optionally skip BQSR
* skip_realignment: optionally skip realignment
* skip_deduplication: optionally skip deduplication
* output: the folder where to publish output, if not provided they will be moved to "output" folder inside the workflow folder
* output: the folder where to publish output, if not provided they will be moved to "output" folder inside the workflow folder* prepare_bam_cpus: default 3
* platform: the platform to be added to the BAM header. Valid values: [ILLUMINA, SOLID, LS454, HELICOS and PACBIO] (default: ILLUMINA)
* prepare_bam_memory: default 8g
* mark_duplicates_cpus: default 16
* mark_duplicates_memory: default 64g
* realignment_around_indels_cpus: default 2
* realignment_around_indels_memory: default 32g
* bqsr_cpus: default 3
* bqsr_memory: default 4g
Output:
* Preprocessed and indexed BAMs
Expand All @@ -70,4 +73,5 @@ Optional input:
Optional output:
* Recalibration report
* Realignment intervals
* Duplication metrics
```
10 changes: 10 additions & 0 deletions environment.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# You can use this file to create a conda environment for this pipeline:
# conda env create -f environment.yml
name: tronflow-bam-preprocessing-1.0.0
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- bioconda::gatk4=4.2.0.0
- bioconda::gatk=3.8
Loading

0 comments on commit 6f82d87

Please sign in to comment.