The pipeline has 8 steps. The core logic is in the stepN_..py files. For many steps there are also run_stepN_..sh files with example shell commands for running that step.
steps 1, 2, 3: Do sample selection for each variant - specifically, take genotypes and metadata from the variant callset and generate one .tsv file per-sample. This .tsv file lists the 1,000s to 100,000s of variants for which that sample will provide a read visualization.
These 3 steps are implemented as hail scripts, intended to run on a dataproc cluster.
There are 3 steps instead of 1 because they require different dataproc cluster shapes: step1 is more like the typical hail pipeline and can run on auto-scaled preemptible nodes, step2 is mainly doing a rekey operation with a large shuffle and so runs better on higher-mem machines without auto-scaling, and step3 is just writing out thousands of .tsvs which hail currently doesn't parallelize, so this step is split across very many small hail clusters as a work-around.
step 4: Generate a single .tsv file with 1 row per sample, listing the .tsv path from step3 as well as the .cram path. This is a python script intended to run on your laptop.
step 5: Actually run GATK HaplotypeCaller with --bamout
on each sample to generate a per-sample .bam file with realigned reads. Use -L to restrict HaplotypeCaller to small windows around the variants selected for this sample in steps 1-3. This can be run as a Batch pipeline, or on Terra using a .wdl
step 6: Similar to step 4 - generates another metadata .tsv, this time with .bamout paths instead of .cram paths.
step 7, 8: Process bamout bams from step 5 to remove all extraneous or sensitive info, and then pancake the bams to create fully-deidentified bams where read data can't be phased across variants. Also, generate sqlite dbs that let the gnomAD browser know how many readviz tracks are available for each variant, and which pancaked bam files contain which tracks.
For each variant, select the ~10 sample ids to use for readviz tracks for each genotype (hom, het, hemi). Output a hail table that lists these sample ids for each variant.
Inputs:
gs://gnomad/metadata/genomes_v3.1/gnomad_v3.1_sample_qc_metadata.ht
- sample metadata used for selecting samples that have crams available and arerelease=True
. Also, it's used to get sample sex to determine which genotypes arehemi
.gs://gnomad/raw/genomes/3.1/gnomad_v3.1_sparse_unsplit.repartitioned.mt
- matrix table with genotypes
Outputs:
gs://gnomad-readviz/v3_and_v3.1/gnomad_v3_1_readviz_crams.ht
has the schema:---------------------------------------- Row fields: 'locus': locus<GRCh38> 'alleles': array<str> 'samples_w_het_var': array<struct { S: str, GQ: int32 }> 'samples_w_hom_var': array<struct { S: str, GQ: int32 }> 'samples_w_hemi_var': array<struct { S: str, GQ: int32 }> ---------------------------------------- Key: ['locus', 'alleles']
To prepare for writing out per-sample .tsvs, rekey the hail table from step1 to key-by-sample instead of -by-variant.
Inputs:
gs://gnomad/readviz/genomes_v3/gnomad_v3_readviz_crams.ht
- generated by step 1
Outputs:
gs://gnomad-readviz/v3_and_v3.1/gnomad_v3_1_readviz_crams_exploded_with_key.ht/
- intermediate checkpoint filegs://gnomad-readviz/v3_and_v3.1/gnomad_v3_1_readviz_crams_exploded_keyed_by_sample.ht/
has the schema:---------------------------------------- Row fields: 'ref': str 'alt': str 'het_or_hom_or_hemi': int32 'GQ': int32 'S': str 'chrom': str 'pos': int32 ---------------------------------------- Key: ['S']
Takes the keyed-by-sample hail table output in step 2, and writes out a tsv for each sample, containing the list of
variants that will be used for readviz from that sample. hail currently doesn't parallelize .tsv output, so this step is split across very many small hail clusters as a work-around (see run_step3__export_per_sample_tsvs.sh
).
To prepare for downstream steps, generate a single .tsv with 1 row per sample containing the paths of that sample's cram and tsv file (from step3). The column names are:
sample_id variants_tsv_bgz cram_path crai_path
NOTE: The #%%
in the code allow this to be run as an interactive notebook when using IntelliJ Scientific Mode
This is the step that actually runs haplotype caller to generate a bamout for each sample, using an interval list with small windows around the variants selected for that sample. The step5 python file is a Batch pipeline. It uses the docker image built from docker/Dockerfile
by running cd docker; make all
.
Instead of using this Batch pipeline, step5 can also be done on Terra (for convenience or cost reasons, depending on how Batch and Terra pricing evolves). For the Terra/cromwell pipeline, there's a .wdl in terra/wdl/GetReadVizData.wdl
.
Both the Batch pipeline and Terra take the .tsv generated in step4 to determine the pipeline steps - each row in that .tsv will correspond to a parallel job that runs HaplotypeCaller on the raw cram for a single sample, after using that sample's selected variants tsv (generated in step3) to compute the interval list to pass to HaplotypeCaller using -L.
NOTE: it can be important to update the docker image and pipeline to use the same GATK version and HaplotypeCaller arguments that were used to generate the variant callset. I typically check the HaplotypeCaller version in a gvcf header, and double-check with whoever generated the callset that there weren't any unusual non-default HaplotypeCaller args used.
Like step4, this step interactively generates a single tsv with per-sample file paths to pass into steps 7 & 8. Unlike step4, it contains paths for bamouts generated in step5 instead of the raw crams. The step6 output table contains the following columns
sample_id output_bamout_bam output_bamout_bai variants_tsv_bgz exclude_variants_tsv_bgz
Also, if issues are later found in readviz visualizations that require certain samples and/or variants to be removed, this step allows for this filtering.
Samples that are excluded from the output table will not end up in the final output produced by step8. Also, the exclude_variants_tsv_bgz
column
optionalally allows another per-sample .tsv to provide a list of variants that should be discarded. This variant filtering would happen in step7 as the bam
is being reprocessed to strip metadata, readgroup names, etc.
NOTE: The #%%
in the code allow this to be run as an interactive notebook when using IntelliJ Scientific Mode
This is another Batch pipeline that takes each bamout from step5 and runs the deidentify_bamout.py script on it. The script removes extraneous or sensitive metadata, obfuscates and downsize read names, and discards all read tags except an opaque read group which is unique for this variant and sample.
This is the last Batch pipeline - it pancakes the single-sample deidentified bams from step7 into bams with many (currently 50) samples, so that it's not possible to tell which variant came from which sample. Also, it combines the per-sample sqlite databases from step7 into 1 database per chromosome. Both the output bams and sqlite databases can be made public, and are directly accessed by the gnomAD browser.