GitHub - leklab/gnomad-readviz: scripts that precompute minibams and other files necessary for the readviz feature of https://gnomad.broadinstitute.org

Readviz Pipeline

Overview

The pipeline has 8 steps. The core logic is in the stepN_..py files. For many steps there are also run_stepN_..sh files with example shell commands for running that step.

steps 1, 2, 3: Do sample selection for each variant - specifically, take genotypes and metadata from the variant callset and generate one .tsv file per-sample. This .tsv file lists the 1,000s to 100,000s of variants for which that sample will provide a read visualization.
These 3 steps are implemented as hail scripts, intended to run on a dataproc cluster. There are 3 steps instead of 1 because they require different dataproc cluster shapes: step1 is more like the typical hail pipeline and can run on auto-scaled preemptible nodes, step2 is mainly doing a rekey operation with a large shuffle and so runs better on higher-mem machines without auto-scaling, and step3 is just writing out thousands of .tsvs which hail currently doesn't parallelize, so this step is split across very many small hail clusters as a work-around.

step 4: Generate a single .tsv file with 1 row per sample, listing the .tsv path from step3 as well as the .cram path. This is a python script intended to run on your laptop.

step 5: Actually run GATK HaplotypeCaller with --bamout on each sample to generate a per-sample .bam file with realigned reads. Use -L to restrict HaplotypeCaller to small windows around the variants selected for this sample in steps 1-3. This can be run as a Batch pipeline, or on Terra using a .wdl

step 6: Similar to step 4 - generates another metadata .tsv, this time with .bamout paths instead of .cram paths.

step 7, 8: Process bamout bams from step 5 to remove all extraneous or sensitive info, and then pancake the bams to create fully-deidentified bams where read data can't be phased across variants. Also, generate sqlite dbs that let the gnomAD browser know how many readviz tracks are available for each variant, and which pancaked bam files contain which tracks.

step1: select samples

For each variant, select the ~10 sample ids to use for readviz tracks for each genotype (hom, het, hemi). Output a hail table that lists these sample ids for each variant.

Inputs:

gs://gnomad/metadata/genomes_v3.1/gnomad_v3.1_sample_qc_metadata.ht - sample metadata used for selecting samples that have crams available and are release=True. Also, it's used to get sample sex to determine which genotypes are hemi.
gs://gnomad/raw/genomes/3.1/gnomad_v3.1_sparse_unsplit.repartitioned.mt - matrix table with genotypes

Outputs:

gs://gnomad-readviz/v3_and_v3.1/gnomad_v3_1_readviz_crams.ht has the schema:

  ----------------------------------------
  Row fields:
      'locus': locus<GRCh38>
      'alleles': array<str>
      'samples_w_het_var': array<struct {
          S: str,
          GQ: int32
      }>
      'samples_w_hom_var': array<struct {
          S: str,
          GQ: int32
      }>
      'samples_w_hemi_var': array<struct {
          S: str,
          GQ: int32
      }>
  ----------------------------------------
  Key: ['locus', 'alleles']

step2: rekey

To prepare for writing out per-sample .tsvs, rekey the hail table from step1 to key-by-sample instead of -by-variant.

Inputs:

gs://gnomad/readviz/genomes_v3/gnomad_v3_readviz_crams.ht - generated by step 1

Outputs:

gs://gnomad-readviz/v3_and_v3.1/gnomad_v3_1_readviz_crams_exploded_with_key.ht/ - intermediate checkpoint file

gs://gnomad-readviz/v3_and_v3.1/gnomad_v3_1_readviz_crams_exploded_keyed_by_sample.ht/ has the schema:

----------------------------------------
Row fields:
  'ref': str
  'alt': str
  'het_or_hom_or_hemi': int32
  'GQ': int32
  'S': str
  'chrom': str
  'pos': int32
----------------------------------------
Key: ['S']

step3: export per-sample tsvs

Takes the keyed-by-sample hail table output in step 2, and writes out a tsv for each sample, containing the list of variants that will be used for readviz from that sample. hail currently doesn't parallelize .tsv output, so this step is split across very many small hail clusters as a work-around (see run_step3__export_per_sample_tsvs.sh).

step4: generate tsv of cram paths

To prepare for downstream steps, generate a single .tsv with 1 row per sample containing the paths of that sample's cram and tsv file (from step3). The column names are:

sample_id 	variants_tsv_bgz	cram_path	  crai_path

NOTE: The #%% in the code allow this to be run as an interactive notebook when using IntelliJ Scientific Mode

step5: run haplotype caller per sample

This is the step that actually runs haplotype caller to generate a bamout for each sample, using an interval list with small windows around the variants selected for that sample. The step5 python file is a Batch pipeline. It uses the docker image built from docker/Dockerfile by running cd docker; make all. Instead of using this Batch pipeline, step5 can also be done on Terra (for convenience or cost reasons, depending on how Batch and Terra pricing evolves). For the Terra/cromwell pipeline, there's a .wdl in terra/wdl/GetReadVizData.wdl. Both the Batch pipeline and Terra take the .tsv generated in step4 to determine the pipeline steps - each row in that .tsv will correspond to a parallel job that runs HaplotypeCaller on the raw cram for a single sample, after using that sample's selected variants tsv (generated in step3) to compute the interval list to pass to HaplotypeCaller using -L.

NOTE: it can be important to update the docker image and pipeline to use the same GATK version and HaplotypeCaller arguments that were used to generate the variant callset. I typically check the HaplotypeCaller version in a gvcf header, and double-check with whoever generated the callset that there weren't any unusual non-default HaplotypeCaller args used.

step6: subset tsv to successful bamouts

Like step4, this step interactively generates a single tsv with per-sample file paths to pass into steps 7 & 8. Unlike step4, it contains paths for bamouts generated in step5 instead of the raw crams. The step6 output table contains the following columns

sample_id   output_bamout_bam   output_bamout_bai    variants_tsv_bgz   exclude_variants_tsv_bgz

Also, if issues are later found in readviz visualizations that require certain samples and/or variants to be removed, this step allows for this filtering. Samples that are excluded from the output table will not end up in the final output produced by step8. Also, the exclude_variants_tsv_bgz column optionalally allows another per-sample .tsv to provide a list of variants that should be discarded. This variant filtering would happen in step7 as the bam is being reprocessed to strip metadata, readgroup names, etc.

NOTE: The #%% in the code allow this to be run as an interactive notebook when using IntelliJ Scientific Mode

step7: deidentify bamouts

This is another Batch pipeline that takes each bamout from step5 and runs the deidentify_bamout.py script on it. The script removes extraneous or sensitive metadata, obfuscates and downsize read names, and discards all read tags except an opaque read group which is unique for this variant and sample.

step8: combine deidentified bamouts

This is the last Batch pipeline - it pancakes the single-sample deidentified bams from step7 into bams with many (currently 50) samples, so that it's not possible to tell which variant came from which sample. Also, it combines the per-sample sqlite databases from step7 into 1 database per chromosome. Both the output bams and sqlite databases can be made public, and are directly accessed by the gnomAD browser.

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
wdl		wdl
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
combine_sqlite_dbs.py		combine_sqlite_dbs.py
deidentify_bamout.py		deidentify_bamout.py
lookup_paths_for_variant.py		lookup_paths_for_variant.py
run_step1__select_samples.sh		run_step1__select_samples.sh
run_step2__rekey.sh		run_step2__rekey.sh
run_step3__export_per_sample_tsvs.sh		run_step3__export_per_sample_tsvs.sh
run_step7__deidentify_bamouts.sh		run_step7__deidentify_bamouts.sh
run_step8__combine_deidentified_bamouts.sh		run_step8__combine_deidentified_bamouts.sh
select_samples		select_samples
step1__select_samples.py		step1__select_samples.py
step2__rekey.py		step2__rekey.py
step3__export_per_sample_tsvs.py		step3__export_per_sample_tsvs.py
step4__generate_tsv_of_cram_paths.py		step4__generate_tsv_of_cram_paths.py
step5__run_haplotype_caller_per_sample.py		step5__run_haplotype_caller_per_sample.py
step6__subset_tsv_to_successful_bamouts.py		step6__subset_tsv_to_successful_bamouts.py
step7__deidentify_bamouts.py		step7__deidentify_bamouts.py
step8__combine_deidentified_bamouts.py		step8__combine_deidentified_bamouts.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Readviz Pipeline

Overview

step1: select samples

step2: rekey

step3: export per-sample tsvs

step4: generate tsv of cram paths

step5: run haplotype caller per sample

step6: subset tsv to successful bamouts

step7: deidentify bamouts

step8: combine deidentified bamouts

About

Releases

Packages

Languages

License

leklab/gnomad-readviz

Folders and files

Latest commit

History

Repository files navigation

Readviz Pipeline

Overview

step1: select samples

step2: rekey

step3: export per-sample tsvs

step4: generate tsv of cram paths

step5: run haplotype caller per sample

step6: subset tsv to successful bamouts

step7: deidentify bamouts

step8: combine deidentified bamouts

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages