Skip to content

human-pangenomics/hpp_production_workflows

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HPP Production Workflows

This repository holds WDL workflows and Docker build scripts for production workflows for data QC, assembly generation, and assembly QC used by the Human Pangenome Reference Consortium.

All WDLs and containers created in this repository are licensed under the MIT license. The underlying tools (that the WDLs and containers run) are likely covered under one or more Free and Open Source Software licenses, but we cannot make any guarantees to that fact.


Repository Organization

Workflows are split across data_processing, assembly, and (assembly) QC folders; each with the following folder structure:

 ── docker/
    └── toolName/
        └── Dockerfile
        └── Makefile
        └── scripts/
            └── toolName/
                └── scriptName.py
 ── wdl/
    └── tasks/
    │   └── taskName.wdl
    └── workflows/
        └── workFlowName.wdl

The root level of the data_processing, assembly, and (assembly) QC folders each contain a readme that provides details about the workflows and how to use them. Summaries of the workflows in each area are below.


Workflow Types

Data Processing

The HPRC produces HiFi, ONT, and Illumina Hi-C data. Each data type has a workflow to check data files to ensure they pass QC.

  • HiFi QC Workflow
    • Check for file-level sample swaps with NTSM
    • Calculate coverage (Gbp) and insert(N50) metrics from fastqs/bams using in-house tooling
    • Check for methylation and kinetics tags (in progress)
  • ONT QC Workflow
    • Check for file-level sample swaps with NTSM
    • Calculate coverage (Gbp) and insert(N50) metrics from summary files using in-house tooling
  • Hi-C QC Workflow
    • Check for file-level sample swaps with NTSM
    • Calculate total bases for the data file

Assembly

Assemblies are produced with one of two Hifiasm workflows using HiFi and ONT ultralong reads with phasing by either Illumina Hi-C or parental Illumina data for the Hi-C and trio workflows, respectively. The major steps included in the assembly workflows are:

  • Yak for creation of kmer databases for trio phased assemblies
  • Cutadapt for adapter filtering of HiFi reads
  • Run Hifiasm with HiFi and ONT ultralong and trio or Hi-C phasing
  • Yak for sex chromosome assignment in Hi-C phased assemblies

In addition to the Hifiasm workflows there is an assembly cleanup workflow which:

  • Removes contamination with NCBI's FCS
  • Removes mitochondrial contigs
  • Runs MitoHiFi to assemble mitochondrial contigs
  • Assigns chromosome labels to fasta headers of T2T contigs/scaffolds

Polishing

Assemblies are polished using a custom pipeline based around DeepPolisher. The polishing pipeline workflow wdl can be found at polishing/wdl/workflows/hprc_DeepPolisher.wdl. The major steps in the HPRC assembly polishing pipeline are:

  • Alignment of all HiFi reads to the diploid assembly using minimap2
  • Alignment of all ONT UL reads > 100kb separately to each haplotype assembly using minimap2
  • PHARAOH pipeline. PHARAOH ensures optimal HiFi read phasing, by leveraging ONT UL information to assign HiFi reads to the correct haplotype in stretches of homozygosity longer than 20kb.
  • DeepPolisher is an encoder-only transformer model which is run on the PHARAOH-corrected HiFi alignments, to predict polishing edits in the assemblies.

QC

Automated Assembly QC

Assembly QC is broken down into two types:

  • standard_qc: these tools are relatively fast to run and provide insight into the completeness, correctness, and contiguity of the assemblies.
  • alignment_based_qc: these tools rely on long read alignment of a sample's reads to it's own assembly. The alignments are then used to identify unexpected variation that indicates missassembly.

The following tools are included in the standard_qc pipeline:

The following tools are included in the alignment_based_qc pipeline:


Running WDLs

If you haven't run a WDL before, there are good resources online to get started. You first need to choose a way to run WDLs. Below are a few options:

  • Terra: An online platform with a GUI which can run workflows in Google Cloud or Microsoft Azure.
  • Cromwell: A workflow runner from the BROAD with support for Slurm, local compute, and multiple cloud platforms.
  • Toil: A workflow runner from UCSC with support for Slurm, local compute, and multiple cloud platforms.

Running with Cromwell

Before starting, read the Cromwell 5 minute intro.

Once you've done that, download the latest version of cromwell and make it executable. (Replace XY with newest version number)

wget https://github.com/broadinstitute/cromwell/releases/download/86/cromwell-XY.jar
chmod +x cromwell-XY.jar

And run your WDL:

java -jar cromwell-XY.jar run \
   /path/to/my_workflow.wdl \
   -i my_workflow_inputs.json \
   > run_log.txt

Input files

Each workflow requires an input json. You can create a template using womtool:

java -jar womtool-XY.jar \
    inputs \
    /path/to/my_workflow.wdl \
    > my_workflow_inputs.json

About

WDL’s and Dockerfiles for assembly QC process

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published