From af9147ca97f17f8360bbfcec363fe7e1d14e36c6 Mon Sep 17 00:00:00 2001 From: priesgof Date: Thu, 15 Apr 2021 23:05:21 +0200 Subject: [PATCH] improve documentation --- README.md | 19 +++++++++++++------ main.nf | 21 ++++++++++++++------- 2 files changed, 27 insertions(+), 13 deletions(-) diff --git a/README.md b/README.md index 635f327..2e476f4 100644 --- a/README.md +++ b/README.md @@ -12,9 +12,9 @@ GATK has been providing a well known best practices document on BAM preprocessin ## Objectives -We aim at providing a single implementation of the BAM preprocessing pipeline that can be used across different situations. For this purpose there are some required steps and some optional steps. This is implemented as a Nextflow pipeline to simplify parallelization of execution in the cluster. The default configuration uses reference genome hg19, if another reference is needed the adequate resources must be provided. The reference genome resources for hg19 are installed in /projects/data/gatk_bundle/hg19 and they were downloaded from https://software.broadinstitute.org/gatk/download/bundle +We aim at providing a single implementation of the BAM preprocessing pipeline that can be used across different situations. For this purpose there are some required steps and some optional steps. This is implemented as a Nextflow pipeline to simplify parallelization of execution in the cluster. The default configuration uses reference genome hg19, if another reference is needed the adequate resources must be provided. The reference genome resources for hg19 were downloaded from https://software.broadinstitute.org/gatk/download/bundle -The input is a configuration file so multiple BAMs can run easily. The output is another tab-separated values file with the absolute paths of the preprocessed and indexed BAMs. +The input is a tab-separated values file where each line corresponds to one input BAM. The output is another tab-separated values file with the absolute paths of the preprocessed and indexed BAMs. ## Implementation @@ -22,10 +22,8 @@ Steps: * **Clean BAM**. Sets the mapping quality to 0 for all unmapped reads and avoids soft clipping going beyond the reference genome boundaries. Implemented in Picard * **Reorder chromosomes**. Makes the chromosomes in the BAM follow the same order as the reference genome. Implemented in Picard -* **Sort by query name**. Ensuring the order by query name allows to find duplicates also in the unpaired and secondary alignment reads. Implemented in Picard * **Add read groups**. GATK requires that some headers are adde to the BAM, also we want to flag somehow the normal and tumor BAMs in the header as some callers, such as Mutect2 require it. Implemented in Picard. - * **Mark duplicates** (optional). Identify the PCR and the optical duplications and marks those reads. Implemented in Picard - * **Sort by coordinates**. This order is required by all GATK tools. Implemented in Picard + * **Mark duplicates** (optional). Identify the PCR and the optical duplications and marks those reads. This uses the parallelized version on Spark, it is reported to scale linearly up to 16 CPUs. * **Realignment around indels** (optional). This procedure is important for locus based variant callers, but for any variant caller doing haplotype assembly it is not needed. This is computing intensive as it first finds regions for realignment where there are indication of indels and then it performs a local realignment over those regions. Implemented in GATK3, deprecated in GATK4 * **Base Quality Score Recalibration (BQSR)** (optional). It aims at correcting systematic errors in the sequencer when assigning the base call quality errors, as these scores are used by variant callers it improves variant calling in some situations. Implemented in GATK4 @@ -58,7 +56,15 @@ Optional input: * skip_bqsr: optionally skip BQSR * skip_realignment: optionally skip realignment * skip_deduplication: optionally skip deduplication - * output: the folder where to publish output, if not provided they will be moved to "output" folder inside the workflow folder + * output: the folder where to publish output, if not provided they will be moved to "output" folder inside the workflow folder* prepare_bam_cpus: default 3 + * platform: the platform to be added to the BAM header. Valid values: [ILLUMINA, SOLID, LS454, HELICOS and PACBIO] (default: ILLUMINA) + * prepare_bam_memory: default 8g + * mark_duplicates_cpus: default 16 + * mark_duplicates_memory: default 64g + * realignment_around_indels_cpus: default 2 + * realignment_around_indels_memory: default 32g + * bqsr_cpus: default 3 + * bqsr_memory: default 4g Output: * Preprocessed and indexed BAMs @@ -67,4 +73,5 @@ Optional input: Optional output: * Recalibration report * Realignment intervals + * Duplication metrics ``` diff --git a/main.nf b/main.nf index ae03dde..8a19466 100755 --- a/main.nf +++ b/main.nf @@ -3,10 +3,10 @@ publish_dir = 'output' params.help= false params.input_files = false -params.reference = "/projects/data/gatk_bundle/hg19/ucsc.hg19.fasta" // TODO: remove this hard coded bit -params.dbsnp = "/projects/data/gatk_bundle/hg19/dbsnp_138.hg19.vcf" // TODO: remove this hard coded bit -params.known_indels1 = "/projects/data/gatk_bundle/hg19/1000G_phase1.indels.hg19.sites.vcf" // TODO: remove this hard coded bit -params.known_indels2 = "/projects/data/gatk_bundle/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.sorted.vcf" // TODO: remove this hard coded bit +params.reference = "/projects/data/gatk_bundle/hg19/ucsc.hg19.fasta" +params.dbsnp = "/projects/data/gatk_bundle/hg19/dbsnp_138.hg19.vcf" +params.known_indels1 = "/projects/data/gatk_bundle/hg19/1000G_phase1.indels.hg19.sites.vcf" +params.known_indels2 = "/projects/data/gatk_bundle/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.sorted.vcf" params.skip_bqsr = false params.skip_realignment = false params.skip_deduplication = false @@ -15,10 +15,8 @@ params.platform = "ILLUMINA" params.prepare_bam_cpus = 3 params.prepare_bam_memory = "8g" -params.mark_duplicates_cpus = 8 +params.mark_duplicates_cpus = 16 params.mark_duplicates_memory = "64g" -params.skip_mark_duplicates_cpus = 1 -params.skip_mark_duplicates_memory = "4g" params.realignment_around_indels_cpus = 2 params.realignment_around_indels_memory = "32g" params.bqsr_cpus = 3 @@ -51,6 +49,14 @@ Optional input: * skip_deduplication: optionally skip deduplication * output: the folder where to publish output * platform: the platform to be added to the BAM header. Valid values: [ILLUMINA, SOLID, LS454, HELICOS and PACBIO] (default: ILLUMINA) + * prepare_bam_cpus: default 3 + * prepare_bam_memory: default 8g + * mark_duplicates_cpus: default 16 + * mark_duplicates_memory: default 64g + * realignment_around_indels_cpus: default 2 + * realignment_around_indels_memory: default 32g + * bqsr_cpus: default 3 + * bqsr_memory: default 4g Output: * Preprocessed and indexed BAM @@ -59,6 +65,7 @@ Output: Optional output: * Recalibration report * Realignment intervals + * Duplication metrics """ }