Merge branch 'develop' into 'master'

Release 1.0.0 See merge request tron/tron-bam-preprocessing!2
TRON-Bioinformatics · Apr 15, 2021 · 6f82d87 · 6f82d87
2 parents 786b16f + 013bf4e
commit 6f82d87
Show file tree

Hide file tree

Showing 19 changed files with 76,491 additions and 137 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1 +1,9 @@
 .idea
+work
+output
+.nextflow*
+report.html*
+timeline.html*
+trace.txt*
+dag.dot*
+*.swp
diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml
@@ -0,0 +1,21 @@
+image: openjdk:11.0.10-jre-buster
+
+
+before_script:
+  - java -version
+  - apt-get update && apt-get --assume-yes install wget make procps
+  - wget -qO- https://get.nextflow.io | bash && cp nextflow /usr/local/bin/nextflow
+  - nextflow help
+  - wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
+  - mkdir /root/.conda
+  - bash Miniconda3-latest-Linux-x86_64.sh -b && cp /root/miniconda3/bin/* /usr/local/bin/
+  - rm -f Miniconda3-latest-Linux-x86_64.sh
+  - conda --version
+
+stages:
+  - test
+
+test:
+  stage: test
+  script:
+    - make clean test
diff --git a/Makefile b/Makefile
@@ -0,0 +1,15 @@
+clean:
+	rm -rf output
+	#rm -rf work
+	rm -f report.html*
+	rm -f timeline.html*
+	rm -f trace.txt*
+	rm -f dag.dot*
+	rm -f .nextflow.log*
+	rm -rf .nextflow*
+
+test:
+	nextflow main.nf -profile test,conda --output output/test1
+	nextflow main.nf -profile test,conda --skip_bqsr --output output/test2
+	nextflow main.nf -profile test,conda --skip_realignment --output output/test3
+	nextflow main.nf -profile test,conda --skip_deduplication --output output/test4
diff --git a/README.md b/README.md
@@ -1,4 +1,4 @@
-# tron-bam-preprocessing
+# TRONflow BAM preprocessing pipeline
 
 Nextflow pipeline for the preprocessing of BAM files based on Picard and GATK.
 
@@ -9,24 +9,21 @@ In order to have a variant calling ready BAM file there are a number of operatio
 
 GATK has been providing a well known best practices document on BAM preprocessing, the latest best practices for GATK4 (https://software.broadinstitute.org/gatk/best-practices/workflow?id=11165) does not perform anymore realignment around indels as opposed to best practices for GATK3 (https://software.broadinstitute.org/gatk/documentation/article?id=3238). This pipeline is based on both Picard and GATK. These best practices have been implemented a number of times, see for instance this implementation in Workflow Definition Language https://github.com/gatk-workflows/gatk4-data-processing/blob/master/processing-for-variant-discovery-gatk4.wdl.
 
-At TRON we have a number of implementations of the BAM preprocessing pipeline, each one of those varies depending on the context. For instance, the script to run mutect has this pipeline embedded, see /code/iCaM/scripts/mutect.sh. This is repeated in some other places.
 
 ## Objectives
 
-We aim at providing a single implementation of the BAM preprocessing pipeline that can be used across different situations. For this purpose there are some required steps and some optional steps. This is implemented as a Nextflow pipeline to simplify parallelization of execution in the cluster. The default configuration uses reference genome hg19, if another reference is needed the adequate resources must be provided. The reference genome resources  for hg19 are installed in /projects/data/gatk_bundle/hg19 and they were downloaded from https://software.broadinstitute.org/gatk/download/bundle
+We aim at providing a single implementation of the BAM preprocessing pipeline that can be used across different situations. For this purpose there are some required steps and some optional steps. This is implemented as a Nextflow pipeline to simplify parallelization of execution in the cluster. The default configuration uses reference genome hg19, if another reference is needed the adequate resources must be provided. The reference genome resources  for hg19 were downloaded from https://software.broadinstitute.org/gatk/download/bundle
 
-The input is a configuration file so multiple BAMs can run easily. The output is another tab-separated values file with the absolute paths of the preprocessed and indexed BAMs.
+The input is a tab-separated values file where each line corresponds to one input BAM. The output is another tab-separated values file with the absolute paths of the preprocessed and indexed BAMs.
 
 ## Implementation
 
 Steps:
 
 * **Clean BAM**. Sets the mapping quality to 0 for all unmapped reads and avoids soft clipping going beyond the reference genome boundaries. Implemented in Picard
 * **Reorder chromosomes**. Makes the chromosomes in the BAM follow the same order as the reference genome. Implemented in Picard
-* **Sort by query name**. Ensuring the order by query name allows to find duplicates also in the unpaired and secondary alignment reads. Implemented in Picard
 * **Add read groups**. GATK requires that some headers are adde to the BAM, also we want to flag somehow the normal and tumor BAMs in the header as some callers, such as Mutect2 require it. Implemented in Picard.
- * **Mark duplicates** (optional). Identify the PCR and the optical duplications and marks those reads. Implemented in Picard
- * **Sort by coordinates**. This order is required by all GATK tools. Implemented in Picard
+ * **Mark duplicates** (optional). Identify the PCR and the optical duplications and marks those reads. This uses the parallelized version on Spark, it is reported to scale linearly up to 16 CPUs.
  * **Realignment around indels** (optional). This procedure is important for locus based variant callers, but for any variant caller doing haplotype assembly it is not needed. This is computing intensive as it first finds regions for realignment where there are indication of indels  and then it performs a local realignment over those regions. Implemented in GATK3, deprecated in GATK4
  * **Base Quality Score Recalibration (BQSR)** (optional). It aims at correcting systematic errors in the sequencer when assigning the base call quality errors, as these scores are used by variant callers it improves variant calling in some situations. Implemented in GATK4
 
@@ -35,13 +32,11 @@ Steps:
 ## How to run it
 
 ```
--bash-4.2$ nextflow main.nf --help
+$ nextflow run tron-bioinformatics/tronflow-bam-preprocessing -r v1.0.0 --help
 N E X T F L O W  ~  version 19.07.0
-Launching `bam_preprocessing.nf` [intergalactic_shannon] - revision: e707c77d7b
+Launching `main.nf` [intergalactic_shannon] - revision: e707c77d7b
 Usage:
-    bam_preprocessing.nf --input_files input_files
- 
-This workflow is based on the implementation at /code/iCaM/scripts/mutect.sh
+    main.nf --input_files input_files
  
 Input:
     * input_files: the path to a tab-separated values file containing in each row the sample name, sample type (eg: tumor or normal) and path to the BAM file
@@ -61,7 +56,15 @@ Optional input:
     * skip_bqsr: optionally skip BQSR
     * skip_realignment: optionally skip realignment
     * skip_deduplication: optionally skip deduplication
-    * output: the folder where to publish output, if not provided they will be moved to "output" folder inside the workflow folder
+    * output: the folder where to publish output, if not provided they will be moved to "output" folder inside the workflow folder* prepare_bam_cpus: default 3
+    * platform: the platform to be added to the BAM header. Valid values: [ILLUMINA, SOLID, LS454, HELICOS and PACBIO] (default: ILLUMINA)
+    * prepare_bam_memory: default 8g
+    * mark_duplicates_cpus: default 16
+    * mark_duplicates_memory: default 64g
+    * realignment_around_indels_cpus: default 2
+    * realignment_around_indels_memory: default 32g
+    * bqsr_cpus: default 3
+    * bqsr_memory: default 4g
  
  Output:
     * Preprocessed and indexed BAMs
@@ -70,4 +73,5 @@ Optional input:
 Optional output:
     * Recalibration report
     * Realignment intervals
+    * Duplication metrics
 ```
diff --git a/environment.yml b/environment.yml
@@ -0,0 +1,10 @@
+# You can use this file to create a conda environment for this pipeline:
+#   conda env create -f environment.yml
+name: tronflow-bam-preprocessing-1.0.0
+channels:
+  - conda-forge
+  - bioconda
+  - defaults
+dependencies:
+  - bioconda::gatk4=4.2.0.0
+  - bioconda::gatk=3.8