mitochondrial-calling

Collection of configurations for GATK mitochondrial (MT) small variant calling from short-read WGS alignments.

Currently using the GATK Mitochondria Pipeline 4.5.0.0 WDL. Testing was done with Google Terra and with a local instance of Cromwell with a docker backend.

Best practices were followed in accordance with GATK's best practice workflows

The tool runs as is with Google Terra with the only required configuration being a path to the input BAM or CRAM file.

In summary, this pipeline performs the following tasks:

Subset the mitochondrial reads from a WGS BAM/CRAM
Realign the BAM or CRAM against to the MT and the shifted MT genome with BWA
Carry over BQSR read tags from the WGS BAM/CRAM to the MT BAM
Call variants with Mutect2 on the both the original mitochondrial BAM/CRAM and shifted mitochondrial BAM/CRAM
Merge both variant calls into a consensus VCF using a liftover file for the shifted mitochondrial genome
Apply masking using a blacklists file

Prerequisites

For a local runtime, a cromwell installation is required. See the Cromwell readthedocs
A backend such as Docker or Singularity is also required to use cromwell.
The following reference genome is also needed for the annotation step.
A local GATK installation for running VariantAnnotator
A samtools installation for getting MT coverage

Outside of the WDL the following annotations are performed

gnomAD v3.1 sites
clinvar_20240407
mitomap_disease
tAPOGEE_2024.0.1 Note that MITOMAP disease requires reformatting with PicardTools and tAPOGEE needs to be restructured as a VCF to work with GATK VariantAnnotator

For MITOMAP Disease, make the following edits:

Add the following FORMAT fields to the header with your favorite text editor:

##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">

##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">

##FORMAT=<ID=FT,Number=.,Type=String,Description="Genotype-level filter">

##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">

For tAPOGEE, make the following edits with your favorite spreadsheet editor:

For the INFO column, create the following excel formula to populate it accordingly. Fill the formula down.

="Gene_symbol="&E2&";tAPOGEE_score="&F2&";tAPOGEE_unbiased_score="&G2

Insert new columns for "ID", "QUAL", and "FILTER" in the following order and fill them down with the '.' character.

#CHROM POS ID REF ALT QUAL FILTER INFO

Delete the "Gene_symbol", "t-APOGEE_score", and "t-APOGEE_unbiased_score" columns.
Add the following VCF header:

##fileformat=VCFv4.2

##FILTER=<ID=PASS,Description="All filters passed">

##fileDate=20240503

##source=https://mitimpact.css-mendel.it/cdn/tAPOGEE_2024.0.1.txt.zip

##reference=https://www.ncbi.nlm.nih.gov/nuccore/251831106

##contig=<ID=chrM,length=16569,assembly=hg38>

##INFO=<ID=Gene_symbol,Number=1,Type=String,Description="Gene symbol">

##INFO=<ID=tAPOGEE_score,Number=1,Type=Float,Description="tAPOGEE score">

##INFO=<ID=tAPOGEE_unbiased_score,Number=1,Type=Float,Description="tAPOGEE unbiased score">

Save as '.txt' file. Make sure it stays tab-delimited.
Rename the file extension as .vcf with your file explorer. Use a tool like dos2unix to edit the line endings.

Running m2_anno.sh in your local cromwell instance

Download the WDL ZIP folder from Dockstore.
Unzip it in your current working directory.
Download the WDL references to a path on your local instance.
Edit the input JSON template so the file paths match their actual locations.
Make an input JSON folder for each sample's input JSON using the template provided in this repository.
Edit the variables for cromwell and refPath to match your installation
sh m2_anno.sh intput_json where input_json is the path to the input JSON folder
When complete, this will create an output_vcf folder where final VCFs and annotated VCFs go.

Running mito_report.py to create an excel report for all samples with participant phenotypes

Download the aligned_dna_short_read.tsv, analyte.tsv, and participant.tsv from the latest upload set for GREGoR AnVIL (as of Oct 23, 2024 that would be U08).
Run samtools coverage to get mitochondrial coverage metrics from the WGS BAMs/CRAMs
For CRAMs this requires the --reference argument with the path to the reference FASTA
samtools coverage --reference GRCh38.fa -o wgs_metrics/${sample_id}.coverage.txt ${sample_id}.cram
Use conda to change environments to one with pandas and vcfpy installed.
python3 mito_report.py -p participant.tsv -a analyte.tsv -d aligned_dna_short_read.tsv -c wgs_metrics -i output_vcf -o mito.xlsx
This will print out an excel table with cohort information and all variant calls compiled.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
LICENSE		LICENSE
README.md		README.md
environment.yaml		environment.yaml
input_template.json		input_template.json
m2_anno.sh		m2_anno.sh
mito_report.py		mito_report.py
template.json		template.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mitochondrial-calling

Prerequisites

Running m2_anno.sh in your local cromwell instance

Running mito_report.py to create an excel report for all samples with participant phenotypes

About

Releases

Packages

Languages

License

UCI-GREGoR/mitochondria_m2_wdl

Folders and files

Latest commit

History

Repository files navigation

mitochondrial-calling

Prerequisites

Running m2_anno.sh in your local cromwell instance

Running mito_report.py to create an excel report for all samples with participant phenotypes

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages