Skip to content

Commit

Permalink
Merge branch 'develop' into 'master'
Browse files Browse the repository at this point in the history
Release v1.1.0

See merge request tron/tron-bam-preprocessing!6
  • Loading branch information
Pablo Riesgo Ferreiro committed May 4, 2021
2 parents 7b22cb6 + 20062b5 commit 930e294
Show file tree
Hide file tree
Showing 7 changed files with 881 additions and 69 deletions.
11 changes: 7 additions & 4 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,10 @@ clean:
rm -rf .nextflow*

test:
nextflow main.nf -profile test,conda --output output/test1
nextflow main.nf -profile test,conda --skip_bqsr --output output/test2
nextflow main.nf -profile test,conda --skip_realignment --output output/test3
nextflow main.nf -profile test,conda --skip_deduplication --output output/test4
#nextflow main.nf -profile test,conda --output output/test1
#nextflow main.nf -profile test,conda --skip_bqsr --output output/test2
#nextflow main.nf -profile test,conda --skip_realignment --output output/test3
#nextflow main.nf -profile test,conda --skip_deduplication --output output/test4
#nextflow main.nf -profile test,conda --output output/test5 --skip_metrics
#nextflow main.nf -profile test,conda --output output/test6 --intervals false
nextflow main.nf -profile test,conda --output output/test6 --hs_metrics_target_coverage target_coverage.txt --hs_metrics_per_base_coverage per_base_coverage.txt
64 changes: 41 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,23 +23,32 @@ Steps:
* **Clean BAM**. Sets the mapping quality to 0 for all unmapped reads and avoids soft clipping going beyond the reference genome boundaries. Implemented in Picard
* **Reorder chromosomes**. Makes the chromosomes in the BAM follow the same order as the reference genome. Implemented in Picard
* **Add read groups**. GATK requires that some headers are adde to the BAM, also we want to flag somehow the normal and tumor BAMs in the header as some callers, such as Mutect2 require it. Implemented in Picard.
* **Mark duplicates** (optional). Identify the PCR and the optical duplications and marks those reads. This uses the parallelized version on Spark, it is reported to scale linearly up to 16 CPUs.
* **Realignment around indels** (optional). This procedure is important for locus based variant callers, but for any variant caller doing haplotype assembly it is not needed. This is computing intensive as it first finds regions for realignment where there are indication of indels and then it performs a local realignment over those regions. Implemented in GATK3, deprecated in GATK4
* **Base Quality Score Recalibration (BQSR)** (optional). It aims at correcting systematic errors in the sequencer when assigning the base call quality errors, as these scores are used by variant callers it improves variant calling in some situations. Implemented in GATK4
* **Mark duplicates** (optional). Identify the PCR and the optical duplications and marks those reads. This uses the parallelized version on Spark, it is reported to scale linearly up to 16 CPUs.
* **Realignment around indels** (optional). This procedure is important for locus based variant callers, but for any variant caller doing haplotype assembly it is not needed. This is computing intensive as it first finds regions for realignment where there are indication of indels and then it performs a local realignment over those regions. Implemented in GATK3, deprecated in GATK4
* **Base Quality Score Recalibration (BQSR)** (optional). It aims at correcting systematic errors in the sequencer when assigning the base call quality errors, as these scores are used by variant callers it improves variant calling in some situations. Implemented in GATK4
* **Metrics** (optional). A number of metrics are obtained over the BAM file with Picard's CollectMetrics (eg: duplication, insert size, alignment, etc.).

![Pipeline](bam_preprocessing2.png)

## References

The bam preprocessing workflow use some required references (`--reference`, `--dbsnp`, `--known_indels1` and `--known_indels2`).
These resources can be fetched from the GATK bundle https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle.

Optionally, in order to run Picard's CollectHsMetrics an intervals file will need to be provided (`--intervals`).
This can be built from a BED file using Picard's BedToIntervalList (https://gatk.broadinstitute.org/hc/en-us/articles/360036883931-BedToIntervalList-Picard-)

## How to run it

```
$ nextflow run tron-bioinformatics/tronflow-bam-preprocessing -r v1.0.0 --help
$ nextflow run tron-bioinformatics/tronflow-bam-preprocessing -r v1.1.0 --help
N E X T F L O W ~ version 19.07.0
Launching `main.nf` [intergalactic_shannon] - revision: e707c77d7b
Usage:
main.nf --input_files input_files
Input:
* input_files: the path to a tab-separated values file containing in each row the sample name, sample type (eg: tumor or normal) and path to the BAM file
* --input_files: the path to a tab-separated values file containing in each row the sample name, sample type (eg: tumor or normal) and path to the BAM file
Sample type will be added to the BAM header @SN sample name
The input file does not have header!
Example input file:
Expand All @@ -48,23 +57,32 @@ Input:
name2 tumor tumor.2.bam
Optional input:
* reference: path to the FASTA genome reference (indexes expected *.fai, *.dict)
* dbsnp: path to the dbSNP VCF
* known_indels1: path to a VCF of known indels
* known_indels2: path to a second VCF of known indels
* NOTE: if any of the above parameters is not provided, default hg19 resources will be used
* skip_bqsr: optionally skip BQSR
* skip_realignment: optionally skip realignment
* skip_deduplication: optionally skip deduplication
* output: the folder where to publish output, if not provided they will be moved to "output" folder inside the workflow folder* prepare_bam_cpus: default 3
* platform: the platform to be added to the BAM header. Valid values: [ILLUMINA, SOLID, LS454, HELICOS and PACBIO] (default: ILLUMINA)
* prepare_bam_memory: default 8g
* mark_duplicates_cpus: default 16
* mark_duplicates_memory: default 64g
* realignment_around_indels_cpus: default 2
* realignment_around_indels_memory: default 32g
* bqsr_cpus: default 3
* bqsr_memory: default 4g
* --reference: path to the FASTA genome reference (indexes expected *.fai, *.dict)
* --dbsnp: path to the dbSNP VCF
* --known_indels1: path to a VCF of known indels
* --known_indels2: path to a second VCF of known indels
**NOTE**: if any of the above parameters is not provided, default hg19 resources under
/projects/data/gatk_bundle/hg19/ will be used
* --intervals: path to an intervals file to collect HS metrics from, this can be built with Picard's BedToIntervalList (default: None)
* --hs_metrics_target_coverage: name of output file for target HS metrics (default: None)
* --hs_metrics_per_base_coverage: name of output file for per base HS metrics (default: None)
* --skip_bqsr: optionally skip BQSR (default: false)
* --skip_realignment: optionally skip realignment (default: false)
* --skip_deduplication: optionally skip deduplication (default: false)
* --skip_metrics: optionally skip metrics (default: false)
* --output: the folder where to publish output (default: ./output)
* --platform: the platform to be added to the BAM header. Valid values: [ILLUMINA, SOLID, LS454, HELICOS and PACBIO] (default: ILLUMINA)
Computational resources:
* --prepare_bam_cpus: (default: 3)
* --prepare_bam_memory: (default: 8g)
* --mark_duplicates_cpus: (default: 16)
* --mark_duplicates_memory: (default: 64g)
* --realignment_around_indels_cpus: (default: 2)
* --realignment_around_indels_memory: (default: 32g)
* --bqsr_cpus: (default: 3)
* --bqsr_memory: (default: 4g)
Output:
* Preprocessed and indexed BAMs
Expand All @@ -73,5 +91,5 @@ Optional input:
Optional output:
* Recalibration report
* Realignment intervals
* Duplication metrics
* Metrics
```
2 changes: 1 addition & 1 deletion environment.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# You can use this file to create a conda environment for this pipeline:
# conda env create -f environment.yml
name: tronflow-bam-preprocessing-1.0.1
name: tronflow-bam-preprocessing-1.1.0
channels:
- conda-forge
- bioconda
Expand Down
161 changes: 121 additions & 40 deletions main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,13 @@ params.reference = "/projects/data/gatk_bundle/hg19/ucsc.hg19.fasta"
params.dbsnp = "/projects/data/gatk_bundle/hg19/dbsnp_138.hg19.vcf"
params.known_indels1 = "/projects/data/gatk_bundle/hg19/1000G_phase1.indels.hg19.sites.vcf"
params.known_indels2 = "/projects/data/gatk_bundle/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.sorted.vcf"
params.intervals = false
params.hs_metrics_target_coverage = false
params.hs_metrics_per_base_coverage = false
params.skip_bqsr = false
params.skip_realignment = false
params.skip_deduplication = false
params.skip_metrics = false
params.output = false
params.platform = "ILLUMINA"

Expand All @@ -27,45 +31,53 @@ params.bqsr_memory = "4g"
def helpMessage() {
log.info"""
Usage:
bam_preprocessing.nf --input_files input_files --reference reference.fasta
main.nf --input_files input_files
Input:
* input_files: the path to a tab-separated values file containing in each row the sample name, sample type (tumor or normal) and path to the BAM file
* --input_files: the path to a tab-separated values file containing in each row the sample name, sample type (eg: tumor or normal) and path to the BAM file
Sample type will be added to the BAM header @SN sample name
The input file does not have header!
Example input file:
name1 tumor tumor.1.bam
name1 normal normal.1.bam
name2 tumor tumor.2.bam
name1 tumor tumor.1.bam
name1 normal normal.1.bam
name2 tumor tumor.2.bam
Optional input:
* reference: path to the FASTA genome reference (indexes expected *.fai, *.dict)
* dbsnp: path to the dbSNP VCF
* known_indels1: path to a VCF of known indels
* known_indels2: path to a second VCF of known indels
* NOTE: if any of the above parameters is not provided, default hg19 resources will be used
* skip_bqsr: optionally skip BQSR
* skip_realignment: optionally skip realignment
* skip_deduplication: optionally skip deduplication
* output: the folder where to publish output
* platform: the platform to be added to the BAM header. Valid values: [ILLUMINA, SOLID, LS454, HELICOS and PACBIO] (default: ILLUMINA)
* prepare_bam_cpus: default 3
* prepare_bam_memory: default 8g
* mark_duplicates_cpus: default 16
* mark_duplicates_memory: default 64g
* realignment_around_indels_cpus: default 2
* realignment_around_indels_memory: default 32g
* bqsr_cpus: default 3
* bqsr_memory: default 4g
Output:
* Preprocessed and indexed BAM
* --reference: path to the FASTA genome reference (indexes expected *.fai, *.dict)
* --dbsnp: path to the dbSNP VCF
* --known_indels1: path to a VCF of known indels
* --known_indels2: path to a second VCF of known indels
**NOTE**: if any of the above parameters is not provided, default hg19 resources under
/projects/data/gatk_bundle/hg19/ will be used
* --intervals: path to an intervals file to collect HS metrics from, this can be built with Picard's BedToIntervalList (default: None)
* --hs_metrics_target_coverage: name of output file for target HS metrics (default: None)
* --hs_metrics_per_base_coverage: name of output file for per base HS metrics (default: None)
* --skip_bqsr: optionally skip BQSR (default: false)
* --skip_realignment: optionally skip realignment (default: false)
* --skip_deduplication: optionally skip deduplication (default: false)
* --skip_metrics: optionally skip metrics (default: false)
* --output: the folder where to publish output (default: ./output)
* --platform: the platform to be added to the BAM header. Valid values: [ILLUMINA, SOLID, LS454, HELICOS and PACBIO] (default: ILLUMINA)
Computational resources:
* --prepare_bam_cpus: (default: 3)
* --prepare_bam_memory: (default: 8g)
* --mark_duplicates_cpus: (default: 16)
* --mark_duplicates_memory: (default: 64g)
* --realignment_around_indels_cpus: (default: 2)
* --realignment_around_indels_memory: (default: 32g)
* --bqsr_cpus: (default: 3)
* --bqsr_memory: (default: 4g)
Output:
* Preprocessed and indexed BAMs
* Tab-separated values file with the absolute paths to the preprocessed BAMs, preprocessed_bams.txt
Optional output:
* Recalibration report
* Realignment intervals
* Duplication metrics
* Metrics
"""
}

Expand Down Expand Up @@ -103,8 +115,10 @@ process prepareBam {
set name, type, file(bam) from input_files

output:
set val(name), val("${bam.baseName}"), val(type),
file("${bam.baseName}.prepared.bam"), file("${bam.baseName}.prepared.bai") into prepared_bams
set val(name),
val("${bam.baseName}"),
val(type), file("${bam.baseName}.prepared.bam"),
file("${bam.baseName}.prepared.bai") into prepared_bams, prepared_bams_for_metrics, prepared_bams_for_hs_metrics

"""
mkdir tmp
Expand All @@ -131,8 +145,6 @@ process prepareBam {
--RGPL ${params.platform} \
--SORT_ORDER coordinate \
--CREATE_INDEX true
rm -rf tmp
"""
}

Expand All @@ -145,16 +157,18 @@ if (!params.skip_deduplication) {
cpus "${params.mark_duplicates_cpus}"
memory "${params.mark_duplicates_memory}"
tag "${name}"
publishDir "${publish_dir}/${name}", mode: "copy", pattern: "*.dedup_metrics.txt"
publishDir "${publish_dir}/${name}/metrics", mode: "copy", pattern: "*.dedup_metrics"

input:
set name, bam_name, type, file(bam), file(bai) from prepared_bams

output:
set val(name), val(bam_name), val(type),
file("${bam.baseName}.dedup.bam"), file("${bam.baseName}.dedup.bam.bai") into deduplicated_bams
file("${bam.baseName}.dedup_metrics.txt") into deduplication_metrics
file("${bam.baseName}.dedup_metrics") optional true into deduplication_metrics

script:
dedup_metrics = params.skip_metrics ? "--metrics-file ${bam.baseName}.dedup_metrics" : ""
"""
mkdir tmp
Expand All @@ -163,16 +177,87 @@ if (!params.skip_deduplication) {
--input ${bam} \
--output ${bam.baseName}.dedup.bam \
--conf 'spark.executor.cores=${task.cpus}' \
--metrics-file ${bam.baseName}.dedup_metrics.txt
rm -rf tmp
${dedup_metrics}
"""
}
}
else {
deduplicated_bams = prepared_bams
}

if (! params.skip_metrics) {

if (params.intervals) {

process hsMetrics {
cpus 1
memory "2g"
tag "${name}"
publishDir "${publish_dir}/${name}/metrics", mode: "copy"

input:
set name, bam_name, type, file(bam), file(bai) from prepared_bams_for_hs_metrics

output:
file("*_metrics") optional true into txt_hs_metrics
file("*.pdf") optional true into pdf_hs_metrics
file(params.hs_metrics_target_coverage) optional true into target_hs_metrics
file(params.hs_metrics_per_base_coverage) optional true into per_base_hs_metrics

script:
hs_metrics_target_coverage= params.hs_metrics_target_coverage ?
"--PER_TARGET_COVERAGE ${params.hs_metrics_target_coverage} --REFERENCE_SEQUENCE ${params.reference}" :
""
hs_metrics_per_base_coverage= params.hs_metrics_per_base_coverage ?
"--PER_BASE_COVERAGE ${params.hs_metrics_per_base_coverage}" :
""
"""
mkdir tmp
gatk CollectHsMetrics \
--java-options '-Xmx2g -Djava.io.tmpdir=tmp' \
--INPUT ${bam} \
--OUTPUT ${bam.baseName} \
--TARGET_INTERVALS ${params.intervals} \
--BAIT_INTERVALS ${params.intervals} \
${hs_metrics_target_coverage} ${hs_metrics_per_base_coverage}
"""
}
}

process metrics {
cpus 1
memory "2g"
tag "${name}"
publishDir "${publish_dir}/${name}/metrics", mode: "copy"

input:
set name, bam_name, type, file(bam), file(bai) from prepared_bams_for_metrics

output:
file("*_metrics") optional true into txt_metrics
file("*.pdf") optional true into pdf_metrics

"""
mkdir tmp
gatk CollectMultipleMetrics \
--java-options '-Xmx2g -Djava.io.tmpdir=tmp' \
--INPUT ${bam} \
--OUTPUT ${bam.baseName} \
--REFERENCE_SEQUENCE ${params.reference} \
--PROGRAM QualityScoreDistribution \
--PROGRAM MeanQualityByCycle \
--PROGRAM CollectAlignmentSummaryMetrics \
--PROGRAM CollectBaseDistributionByCycle \
--PROGRAM CollectGcBiasMetrics \
--PROGRAM CollectInsertSizeMetrics \
--PROGRAM CollectSequencingArtifactMetrics \
--PROGRAM CollectSequencingArtifactMetrics
"""
}
}

if (!params.skip_realignment) {
process realignmentAroundindels {
cpus "${params.realignment_around_indels_cpus}"
Expand Down Expand Up @@ -207,8 +292,6 @@ if (!params.skip_realignment) {
--consensusDeterminationModel USE_SW \
--LODThresholdForCleaning 0.4 \
--maxReadsInMemory 600000
rm -rf tmp
"""
}
}
Expand Down Expand Up @@ -248,8 +331,6 @@ if (!params.skip_bqsr) {
--output ${bam_name}.preprocessed.bam \
--reference ${params.reference} \
--bqsr-recal-file ${bam_name}.recalibration_report.grp
rm -rf tmp
"""
}
}
Expand Down
5 changes: 4 additions & 1 deletion nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ profiles {
params.bqsr_memory = "3g"
params.known_indels1 = "$baseDir/test_data/1000G_phase1.indels.hg19.sites.minimal.vcf"
params.known_indels2 = "$baseDir/test_data/Mills_and_1000G_gold_standard.indels.hg19.sites.sorted.minimal.vcf"
params.intervals = "$baseDir/test_data/minimal_intervals.intervals"
params.dbsnp = "$baseDir/test_data/dbsnp_138.hg19.minimal.vcf"
}
}
Expand All @@ -35,6 +36,8 @@ env {
// Capture exit codes from upstream processes when piping
process.shell = ['/bin/bash', '-euo', 'pipefail']

cleanup = true

timeline {
enabled = true
//file = "${params.output}/execution_timeline.html"
Expand All @@ -59,5 +62,5 @@ manifest {
description = 'Picard and GATK BAM preprocessing pipeline'
mainScript = 'main.nf'
nextflowVersion = '>=19.10.0'
version = '1.0.1'
version = '1.1.0'
}
Loading

0 comments on commit 930e294

Please sign in to comment.