This repository contains workflows and data for constructing pangenomes from Saudi population genomic data using the minigraph-cactus pipeline. The workflows are implemented using Common Workflow Language (CWL), enabling reproducible and portable execution across different computing environments.
Minigraph-cactus is a state-of-the-art pipeline that combines the efficiency of minigraph with the accuracy of Progressive Cactus for constructing pangenomes. This hybrid approach offers several advantages:
- Scalable construction of genome graphs from multiple assemblies
- Preservation of structural variants and complex genomic regions
- Efficient handling of large-scale genomic data
- Integration of both reference-based and reference-free approaches
The pipeline first uses minigraph to create initial graphs, followed by Progressive Cactus's sophisticated algorithms for multiple genome alignment, resulting in high-quality pangenome representations.
CWL is a specification for describing analysis workflows and tools. Its key features include:
- Platform independence
- Explicit declaration of dependencies
- Built-in support for Docker/container technologies
- Scalability from single workstations to HPC environments
- Clear separation of tools from the workflow logic
You need to install Docker or Singularity.
# Install required software
pip install cwltool
git clone https://github.com/bio-ontology-research-group/pangenome
cwl-runner variant-calling/main-vg.cwl inputs.yml
Create a YAML file with your inputs:
graph:
class: File
location: reference/ksa_hg38.gbz
reads1:
class: File
location: input/sample_R1.fastq.gz
reads2:
class: File
location: input/sample_R2.fastq.gz
hapl:
class: File
location: reference/ksa_hg38.hapl
ref_prefix: GRCh38
model_type: WGS
ref:
class: File
location: reference/hg38.fa
secondaryFiles:
- class: File
location: reference/hg38.fa.fai
output_cram: aligned_reads.cram
output_vcf: sample.vcf
output_gvcf: sample.gvcf
aligned_reads.cram
: Alignment against GRCh38sample.gvcf
: Genomic VCFsample.vcf
: Variant calls
If you use this pipeline in your research, please cite:
The data is available under CC0
Please create GitHub issues if you encounter any problems.