Minigraph/CACTUS v1.0

The CACTUS pangenome pipeline adds base-level alignments to the minigraph graphs above (so both GRCh38- and CHM13-based graphs are available).

Graphs and associated files are summarized below.

_Description	_{GRCh38 Graph}	_{CHM13 Graph}
_graph	_gfa	_gfa
_{Decomposed VCF}	_{VCF VCF index}
_{Pangenie-ready VCF}	_{VCF VCF index}
_{Raw VCF}	_{VCF VCF index}	_{VCF VCF index VCF(CHM13) VCF(CHM13) index}
_{multiple alignment}	_HAL	_HAL
_{sequences clipped out before alignment}	_masking	_masking
_{VG indexes}	_{xg snarls trans}	_{xg snarls trans}
_{Giraffe indexes}	_{dist min gg gbwt dist(vg<1.44.0) min(vg<1.44.0)}	_{dist min gg gbwt dist(vg<1.44.0) min(vg<1.44.0)}

The graphs are available in gfa format alongside other graph and index files. Information about the associated file formats can be found:

graph formats: xg/gg
index formats: gbwt/dist/min
snarls format: snarls

VCF Decomposition

The Raw VCF files contain a site for each bubble in the graph. Nested bubbles will result in overlapping sites. The nesting relationships are denoted with the PS (parent snarl), LV (level) and AT (allele traversal) tags and need to be taken into account when interpreting the VCF. Alternatively, you can use the "Decomposed VCFs" which have been normalized by using vcfbub to "pop" bubbles with alleles larger than 100k and vcfwave to realign each alt allele to the reference (script). The "Pangenie-ready VCF" was created using a different decomposition that does not use re-alignment (description, intermediate files), with the aim of optimizing genotyping performance with Pangenie.

Filtered Graphs

The Giraffe short read mapper relies on the graph's snarl decomposition. The versions of the Cactus/Minigraph graphs released here contain some spurious large deletion edges that make this decomposition less efficient, which impacts Giraffe runtime. Furthermore, we have found that for calling small variants with the Giraffe-DeepVariant pipeline, accuracy is improved if all alleles with frequency < 10% are removed from the graph before indexing. Two filtered versions of each of the two Minigraph/Cactus graphs are available here. The graphs with maxdel.10mb in the name (recommended to speed up general mapping experiments) were created by removing edges that imply deletions > 10mb, and the graphs with minaf.0.1 in the name (recommended when using with DeepVariant) were created by removing, in addition to the deletions, nodes that are covered by fewer than 9 haplotypes. In order to use vg versions older than v1.44.0 with these graphs, download the .dist.old and .min.old indexes and rename them to .dist and .min (update 4/18/2023: All .dist and .min indexes have been regenerated using a patched vg to fix a speed regression. They remain compatible with vg versions >= v1.44.0).

Masked Sequence

Highly repetitive sequence such as found in centromeres was excluded from the Minigraph/Cactus graphs using the following process. dna-brnn was first run with its default parameters and model to identify alpha satellite and hsat 2/3 regions >100kb, which were clipped out of the input fasta files. Gaps >100kb between minigraph mappings were likewise removed. Any remaining contigs or contig fragments that could not be assigned to a reference chromosome were excluded. Finally, gaps >10kb left unaligned after Cactus were removed. Please note that no sequence was removed from the reference genome of either graph. Each removed interval, as well as the step it was removed by, are available:

regions removed from GRCh38-based graph: hprc-v1.0-mc-grch38.clipped-intervals.bed.gz
regions removed from CHM13-based graph: hprc-v1.0-mc-chm13.clipped-intervals.bed.gz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hprc-v1.0-mc.md

hprc-v1.0-mc.md

Minigraph/CACTUS v1.0

VCF Decomposition

Filtered Graphs

Masked Sequence

Files

hprc-v1.0-mc.md

Latest commit

History

hprc-v1.0-mc.md

File metadata and controls

Minigraph/CACTUS v1.0

VCF Decomposition

Filtered Graphs

Masked Sequence