Skip to content

Commit

Permalink
Merge branch 'develop' into 'master'
Browse files Browse the repository at this point in the history
Release v1.2.0

See merge request tron/tron-bam-preprocessing!8
  • Loading branch information
Pablo Riesgo Ferreiro committed May 6, 2021
2 parents 930e294 + 53e9d78 commit 2e6bdae
Show file tree
Hide file tree
Showing 8 changed files with 117 additions and 86 deletions.
15 changes: 8 additions & 7 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,11 @@ clean:
rm -rf .nextflow*

test:
#nextflow main.nf -profile test,conda --output output/test1
#nextflow main.nf -profile test,conda --skip_bqsr --output output/test2
#nextflow main.nf -profile test,conda --skip_realignment --output output/test3
#nextflow main.nf -profile test,conda --skip_deduplication --output output/test4
#nextflow main.nf -profile test,conda --output output/test5 --skip_metrics
#nextflow main.nf -profile test,conda --output output/test6 --intervals false
nextflow main.nf -profile test,conda --output output/test6 --hs_metrics_target_coverage target_coverage.txt --hs_metrics_per_base_coverage per_base_coverage.txt
nextflow main.nf -profile test,conda --output output/test1
nextflow main.nf -profile test,conda --skip_bqsr --output output/test2
nextflow main.nf -profile test,conda --skip_realignment --output output/test3
nextflow main.nf -profile test,conda --skip_deduplication --output output/test4
nextflow main.nf -profile test,conda --output output/test5 --skip_metrics
nextflow main.nf -profile test,conda --output output/test6 --intervals false
nextflow main.nf -profile test,conda --output output/test7 --hs_metrics_target_coverage target_coverage.txt --hs_metrics_per_base_coverage per_base_coverage.txt
nextflow main.nf -profile test,conda --output output/test8 --hs_metrics_target_coverage target_coverage.txt --hs_metrics_per_base_coverage per_base_coverage.txt --collect_hs_metrics_min_base_quality 10 --collect_hs_metrics_min_mapping_quality 10
32 changes: 18 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# TRONflow BAM preprocessing pipeline

[![DOI](https://zenodo.org/badge/358400957.svg)](https://zenodo.org/badge/latestdoi/358400957)

Nextflow pipeline for the preprocessing of BAM files based on Picard and GATK.


Expand Down Expand Up @@ -32,7 +34,9 @@ Steps:

## References

The bam preprocessing workflow use some required references (`--reference`, `--dbsnp`, `--known_indels1` and `--known_indels2`).
The bam preprocessing workflow requires the human reference genome (`--reference`)
Base Quality Score Recalibration (BQSR) requires dbSNP to avoid extracting error metrics from polymorphic sites (`--dbsnp`)
Realignment around indels requires a set of known indels (`--known_indels1` and `--known_indels2`).
These resources can be fetched from the GATK bundle https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle.

Optionally, in order to run Picard's CollectHsMetrics an intervals file will need to be provided (`--intervals`).
Expand All @@ -41,12 +45,13 @@ This can be built from a BED file using Picard's BedToIntervalList (https://gatk
## How to run it

```
$ nextflow run tron-bioinformatics/tronflow-bam-preprocessing -r v1.1.0 --help
$ nextflow run tron-bioinformatics/tronflow-bam-preprocessing -r v1.2.0 --help
N E X T F L O W ~ version 19.07.0
Launching `main.nf` [intergalactic_shannon] - revision: e707c77d7b
Usage:
main.nf --input_files input_files
Input:
* --input_files: the path to a tab-separated values file containing in each row the sample name, sample type (eg: tumor or normal) and path to the BAM file
Sample type will be added to the BAM header @SN sample name
Expand All @@ -55,25 +60,24 @@ Input:
name1 tumor tumor.1.bam
name1 normal normal.1.bam
name2 tumor tumor.2.bam
Optional input:
* --reference: path to the FASTA genome reference (indexes expected *.fai, *.dict)
* --dbsnp: path to the dbSNP VCF
* --known_indels1: path to a VCF of known indels
* --known_indels2: path to a second VCF of known indels
**NOTE**: if any of the above parameters is not provided, default hg19 resources under
/projects/data/gatk_bundle/hg19/ will be used
Optional input:
* --dbsnp: path to the dbSNP VCF (required to perform BQSR)
* --known_indels1: path to a VCF of known indels (optional to perform realignment around indels)
* --known_indels2: path to a second VCF of known indels (optional to perform realignment around indels)
* --intervals: path to an intervals file to collect HS metrics from, this can be built with Picard's BedToIntervalList (default: None)
* --hs_metrics_target_coverage: name of output file for target HS metrics (default: None)
* --hs_metrics_per_base_coverage: name of output file for per base HS metrics (default: None)
* --collect_hs_minimum_base_quality: minimum base quality for a base to contribute coverage (default: 20).
* --collect_hs_minimum_mapping_quality: minimum mapping quality for a read to contribute coverage (default: 20).
* --skip_bqsr: optionally skip BQSR (default: false)
* --skip_realignment: optionally skip realignment (default: false)
* --skip_deduplication: optionally skip deduplication (default: false)
* --skip_metrics: optionally skip metrics (default: false)
* --output: the folder where to publish output (default: ./output)
* --platform: the platform to be added to the BAM header. Valid values: [ILLUMINA, SOLID, LS454, HELICOS and PACBIO] (default: ILLUMINA)
Computational resources:
* --prepare_bam_cpus: (default: 3)
* --prepare_bam_memory: (default: 8g)
Expand All @@ -83,11 +87,11 @@ Computational resources:
* --realignment_around_indels_memory: (default: 32g)
* --bqsr_cpus: (default: 3)
* --bqsr_memory: (default: 4g)
Output:
* Preprocessed and indexed BAMs
* Tab-separated values file with the absolute paths to the preprocessed BAMs, preprocessed_bams.txt
Optional output:
* Recalibration report
* Realignment intervals
Expand Down
3 changes: 2 additions & 1 deletion environment.yml
Original file line number Diff line number Diff line change
@@ -1,10 +1,11 @@
# You can use this file to create a conda environment for this pipeline:
# conda env create -f environment.yml
name: tronflow-bam-preprocessing-1.1.0
name: tronflow-bam-preprocessing-1.2.0
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- openjdk=8.0.282
- bioconda::gatk4=4.2.0.0
- bioconda::gatk=3.8
91 changes: 30 additions & 61 deletions main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,10 @@
publish_dir = 'output'
params.help= false
params.input_files = false
params.reference = "/projects/data/gatk_bundle/hg19/ucsc.hg19.fasta"
params.dbsnp = "/projects/data/gatk_bundle/hg19/dbsnp_138.hg19.vcf"
params.known_indels1 = "/projects/data/gatk_bundle/hg19/1000G_phase1.indels.hg19.sites.vcf"
params.known_indels2 = "/projects/data/gatk_bundle/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.sorted.vcf"
params.reference = false
params.dbsnp = false
params.known_indels1 = false
params.known_indels2 = false
params.intervals = false
params.hs_metrics_target_coverage = false
params.hs_metrics_per_base_coverage = false
Expand All @@ -16,7 +16,10 @@ params.skip_deduplication = false
params.skip_metrics = false
params.output = false
params.platform = "ILLUMINA"
params.collect_hs_metrics_min_base_quality = false
params.collect_hs_metrics_min_mapping_quality = false

// computational resources
params.prepare_bam_cpus = 3
params.prepare_bam_memory = "8g"
params.mark_duplicates_cpus = 16
Expand All @@ -29,63 +32,22 @@ params.bqsr_memory = "4g"


def helpMessage() {
log.info"""
Usage:
main.nf --input_files input_files
Input:
* --input_files: the path to a tab-separated values file containing in each row the sample name, sample type (eg: tumor or normal) and path to the BAM file
Sample type will be added to the BAM header @SN sample name
The input file does not have header!
Example input file:
name1 tumor tumor.1.bam
name1 normal normal.1.bam
name2 tumor tumor.2.bam
Optional input:
* --reference: path to the FASTA genome reference (indexes expected *.fai, *.dict)
* --dbsnp: path to the dbSNP VCF
* --known_indels1: path to a VCF of known indels
* --known_indels2: path to a second VCF of known indels
**NOTE**: if any of the above parameters is not provided, default hg19 resources under
/projects/data/gatk_bundle/hg19/ will be used
* --intervals: path to an intervals file to collect HS metrics from, this can be built with Picard's BedToIntervalList (default: None)
* --hs_metrics_target_coverage: name of output file for target HS metrics (default: None)
* --hs_metrics_per_base_coverage: name of output file for per base HS metrics (default: None)
* --skip_bqsr: optionally skip BQSR (default: false)
* --skip_realignment: optionally skip realignment (default: false)
* --skip_deduplication: optionally skip deduplication (default: false)
* --skip_metrics: optionally skip metrics (default: false)
* --output: the folder where to publish output (default: ./output)
* --platform: the platform to be added to the BAM header. Valid values: [ILLUMINA, SOLID, LS454, HELICOS and PACBIO] (default: ILLUMINA)
Computational resources:
* --prepare_bam_cpus: (default: 3)
* --prepare_bam_memory: (default: 8g)
* --mark_duplicates_cpus: (default: 16)
* --mark_duplicates_memory: (default: 64g)
* --realignment_around_indels_cpus: (default: 2)
* --realignment_around_indels_memory: (default: 32g)
* --bqsr_cpus: (default: 3)
* --bqsr_memory: (default: 4g)
Output:
* Preprocessed and indexed BAMs
* Tab-separated values file with the absolute paths to the preprocessed BAMs, preprocessed_bams.txt
Optional output:
* Recalibration report
* Realignment intervals
* Metrics
"""
log.info params.help_message
}

if (params.help) {
helpMessage()
exit 0
}

if (!params.reference) {
exit -1, "--reference is required"
}

if (!params.skip_bqsr && !params.dbsnp) {
exit -1, "--dbsnp is required to perform BQSR"
}

if (params.output) {
publish_dir = params.output
}
Expand Down Expand Up @@ -168,7 +130,7 @@ if (!params.skip_deduplication) {
file("${bam.baseName}.dedup_metrics") optional true into deduplication_metrics

script:
dedup_metrics = params.skip_metrics ? "--metrics-file ${bam.baseName}.dedup_metrics" : ""
dedup_metrics = params.skip_metrics ? "": "--metrics-file ${bam.baseName}.dedup_metrics"
"""
mkdir tmp
Expand Down Expand Up @@ -211,6 +173,10 @@ if (! params.skip_metrics) {
hs_metrics_per_base_coverage= params.hs_metrics_per_base_coverage ?
"--PER_BASE_COVERAGE ${params.hs_metrics_per_base_coverage}" :
""
minimum_base_quality = params.collect_hs_metrics_min_base_quality ?
"--MINIMUM_BASE_QUALITY ${params.collect_hs_metrics_min_base_quality}" : ""
minimum_mapping_quality = params.collect_hs_metrics_min_mapping_quality ?
"--MINIMUM_MAPPING_QUALITY ${params.collect_hs_metrics_min_mapping_quality}" : ""
"""
mkdir tmp
Expand All @@ -220,7 +186,7 @@ if (! params.skip_metrics) {
--OUTPUT ${bam.baseName} \
--TARGET_INTERVALS ${params.intervals} \
--BAIT_INTERVALS ${params.intervals} \
${hs_metrics_target_coverage} ${hs_metrics_per_base_coverage}
${hs_metrics_target_coverage} ${hs_metrics_per_base_coverage} ${minimum_base_quality} ${minimum_mapping_quality}
"""
}
}
Expand Down Expand Up @@ -272,26 +238,29 @@ if (!params.skip_realignment) {
set val(name), val(bam_name), val(type), file("${bam.baseName}.realigned.bam"), file("${bam.baseName}.realigned.bai") into realigned_bams
file("${bam.baseName}.RA.intervals") into realignment_intervals

script:
known_indels = "" + params.known_indels1 ? " --known ${params.known_indels1}" : "" +
params.known_indels2 ? " --known ${params.known_indels2}" : ""
known_alleles = "" + params.known_indels1 ? " --knownAlleles ${params.known_indels1}" : "" +
params.known_indels2 ? " --knownAlleles ${params.known_indels2}" : ""
"""
mkdir tmp
gatk3 -Xmx${params.realignment_around_indels_memory} -Djava.io.tmpdir=tmp -T RealignerTargetCreator \
--input_file ${bam} \
--out ${bam.baseName}.RA.intervals \
--reference_sequence ${params.reference} \
--known ${params.known_indels1} \
--known ${params.known_indels2}
${known_indels}
gatk3 -Xmx${params.realignment_around_indels_memory} -Djava.io.tmpdir=tmp -T IndelRealigner \
--input_file ${bam} \
--out ${bam.baseName}.realigned.bam \
--reference_sequence ${params.reference} \
--targetIntervals ${bam.baseName}.RA.intervals \
--knownAlleles ${params.known_indels1} \
--knownAlleles ${params.known_indels2} \
--consensusDeterminationModel USE_SW \
--LODThresholdForCleaning 0.4 \
--maxReadsInMemory 600000
--maxReadsInMemory 600000 \
${known_alleles}
"""
}
}
Expand Down
58 changes: 57 additions & 1 deletion nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -55,12 +55,68 @@ dag {
//file = "${params.output}/pipeline_dag.svg"
}

VERSION = '1.2.0'
DOI = 'https://zenodo.org/badge/latestdoi/358400957'

manifest {
name = 'TRON-Bioinformatics/tronflow-bam-preprocessing'
author = 'Pablo Riesgo Ferreiro'
homePage = 'https://github.com/TRON-Bioinformatics/tronflow-bam-preprocessing'
description = 'Picard and GATK BAM preprocessing pipeline'
mainScript = 'main.nf'
nextflowVersion = '>=19.10.0'
version = '1.1.0'
version = VERSION
doi = DOI
}

params.help_message = """
TronFlow bam preprocessing v${VERSION} ${DOI}
Usage:
main.nf --input_files input_files
Input:
* --input_files: the path to a tab-separated values file containing in each row the sample name, sample type (eg: tumor or normal) and path to the BAM file
Sample type will be added to the BAM header @SN sample name
The input file does not have header!
Example input file:
name1 tumor tumor.1.bam
name1 normal normal.1.bam
name2 tumor tumor.2.bam
* --reference: path to the FASTA genome reference (indexes expected *.fai, *.dict)
Optional input:
* --dbsnp: path to the dbSNP VCF (required to perform BQSR)
* --known_indels1: path to a VCF of known indels (optional to perform realignment around indels)
* --known_indels2: path to a second VCF of known indels (optional to perform realignment around indels)
* --intervals: path to an intervals file to collect HS metrics from, this can be built with Picard's BedToIntervalList (default: None)
* --hs_metrics_target_coverage: name of output file for target HS metrics (default: None)
* --hs_metrics_per_base_coverage: name of output file for per base HS metrics (default: None)
* --collect_hs_minimum_base_quality: minimum base quality for a base to contribute coverage (default: 20).
* --collect_hs_minimum_mapping_quality: minimum mapping quality for a read to contribute coverage (default: 20).
* --skip_bqsr: optionally skip BQSR (default: false)
* --skip_realignment: optionally skip realignment (default: false)
* --skip_deduplication: optionally skip deduplication (default: false)
* --skip_metrics: optionally skip metrics (default: false)
* --output: the folder where to publish output (default: ./output)
* --platform: the platform to be added to the BAM header. Valid values: [ILLUMINA, SOLID, LS454, HELICOS and PACBIO] (default: ILLUMINA)
Computational resources:
* --prepare_bam_cpus: (default: 3)
* --prepare_bam_memory: (default: 8g)
* --mark_duplicates_cpus: (default: 16)
* --mark_duplicates_memory: (default: 64g)
* --realignment_around_indels_cpus: (default: 2)
* --realignment_around_indels_memory: (default: 32g)
* --bqsr_cpus: (default: 3)
* --bqsr_memory: (default: 4g)
Output:
* Preprocessed and indexed BAMs
* Tab-separated values file with the absolute paths to the preprocessed BAMs, preprocessed_bams.txt
Optional output:
* Recalibration report
* Realignment intervals
* Metrics
"""
File renamed without changes.
File renamed without changes.
4 changes: 2 additions & 2 deletions test_data/test_input.txt
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
TESTX_H7YRLADXX_S1_L001 tumor test_data/TESTX_H7YRLADXX_S1_L001.bam
TESTX_H7YRLADXX_S1_L002 normal test_data/TESTX_H7YRLADXX_S1_L002.bam
TESTX_S1_L001 tumor test_data/TESTX_S1_L001.bam
TESTX_S1_L002 normal test_data/TESTX_S1_L002.bam

0 comments on commit 2e6bdae

Please sign in to comment.