Collection of configurations for GATK mitochondrial (MT) small variant calling from short-read WGS alignments.
Currently using the GATK Mitochondria Pipeline 4.5.0.0 WDL. Testing was done with Google Terra and with a local instance of Cromwell with a docker backend.
Best practices were followed in accordance with GATK's best practice workflows
The tool runs as is with Google Terra with the only required configuration being a path to the input BAM or CRAM file.
In summary, this pipeline performs the following tasks:
- Subset the mitochondrial reads from a WGS BAM/CRAM
- Realign the BAM or CRAM against to the MT and the shifted MT genome with BWA
- Carry over BQSR read tags from the WGS BAM/CRAM to the MT BAM
- Call variants with Mutect2 on the both the original mitochondrial BAM/CRAM and shifted mitochondrial BAM/CRAM
- Merge both variant calls into a consensus VCF using a liftover file for the shifted mitochondrial genome
- Apply masking using a blacklists file
- For a local runtime, a cromwell installation is required. See the Cromwell readthedocs
- A backend such as Docker or Singularity is also required to use cromwell.
- The following reference genome is also needed for the annotation step.
- A local GATK installation for running VariantAnnotator
- A samtools installation for getting MT coverage
Outside of the WDL the following annotations are performed
- gnomAD v3.1 sites
- clinvar_20240407
- mitomap_disease
- tAPOGEE_2024.0.1 Note that MITOMAP disease requires reformatting with PicardTools and tAPOGEE needs to be restructured as a VCF to work with GATK VariantAnnotator
For MITOMAP Disease, make the following edits:
- Add the following FORMAT fields to the header with your favorite text editor:
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=FT,Number=.,Type=String,Description="Genotype-level filter">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
For tAPOGEE, make the following edits with your favorite spreadsheet editor:
- For the INFO column, create the following excel formula to populate it accordingly. Fill the formula down.
="Gene_symbol="&E2&";tAPOGEE_score="&F2&";tAPOGEE_unbiased_score="&G2
- Insert new columns for "ID", "QUAL", and "FILTER" in the following order and fill them down with the '.' character.
#CHROM POS ID REF ALT QUAL FILTER INFO
-
Delete the "Gene_symbol", "t-APOGEE_score", and "t-APOGEE_unbiased_score" columns.
-
Add the following VCF header:
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=20240503
##source=https://mitimpact.css-mendel.it/cdn/tAPOGEE_2024.0.1.txt.zip
##reference=https://www.ncbi.nlm.nih.gov/nuccore/251831106
##contig=<ID=chrM,length=16569,assembly=hg38>
##INFO=<ID=Gene_symbol,Number=1,Type=String,Description="Gene symbol">
##INFO=<ID=tAPOGEE_score,Number=1,Type=Float,Description="tAPOGEE score">
##INFO=<ID=tAPOGEE_unbiased_score,Number=1,Type=Float,Description="tAPOGEE unbiased score">
-
Save as '.txt' file. Make sure it stays tab-delimited.
-
Rename the file extension as .vcf with your file explorer. Use a tool like dos2unix to edit the line endings.
- Download the WDL ZIP folder from Dockstore.
- Unzip it in your current working directory.
- Download the WDL references to a path on your local instance.
- Edit the input JSON template so the file paths match their actual locations.
- Make an input JSON folder for each sample's input JSON using the template provided in this repository.
- Edit the variables for
cromwell
andrefPath
to match your installation sh m2_anno.sh intput_json
whereinput_json
is the path to the input JSON folder- When complete, this will create an
output_vcf
folder where final VCFs and annotated VCFs go.
- Download the
aligned_dna_short_read.tsv
,analyte.tsv
, andparticipant.tsv
from the latest upload set for GREGoR AnVIL (as of Oct 23, 2024 that would be U08). - Run
samtools coverage
to get mitochondrial coverage metrics from the WGS BAMs/CRAMs - For CRAMs this requires the
--reference
argument with the path to the reference FASTA samtools coverage --reference GRCh38.fa -o wgs_metrics/${sample_id}.coverage.txt ${sample_id}.cram
- Use conda to change environments to one with pandas and vcfpy installed.
python3 mito_report.py -p participant.tsv -a analyte.tsv -d aligned_dna_short_read.tsv -c wgs_metrics -i output_vcf -o mito.xlsx
- This will print out an excel table with cohort information and all variant calls compiled.