The CACTUS pangenome pipeline adds base-level alignments to the minigraph graphs above (so both GRCh38- and CHM13-based graphs are available).
Graphs and associated files are summarized below.
Description | GRCh38 Graph | CHM13 Graph |
---|---|---|
graph | gfa | gfa |
Decomposed VCF | VCF VCF index | |
Pangenie-ready VCF | VCF VCF index | |
Raw VCF | VCF VCF index | VCF VCF index VCF(CHM13) VCF(CHM13) index |
multiple alignment | HAL | HAL |
sequences clipped out before alignment | masking | masking |
VG indexes | xg snarls trans | xg snarls trans |
Giraffe indexes | dist min gg gbwt dist(vg<1.44.0) min(vg<1.44.0) | dist min gg gbwt dist(vg<1.44.0) min(vg<1.44.0) |
The graphs are available in gfa format alongside other graph and index files. Information about the associated file formats can be found:
- graph formats: xg/gg
- index formats: gbwt/dist/min
- snarls format: snarls
The Raw VCF files contain a site for each bubble in the graph. Nested bubbles will result in overlapping sites. The nesting relationships are denoted with the PS
(parent snarl), LV
(level) and AT
(allele traversal) tags and need to be taken into account when interpreting the VCF. Alternatively, you can use the "Decomposed VCFs" which have been normalized by using vcfbub to "pop" bubbles with alleles larger than 100k and vcfwave to realign each alt allele to the reference (script). The "Pangenie-ready VCF" was created using a different decomposition that does not use re-alignment (description, intermediate files), with the aim of optimizing genotyping performance with Pangenie.
The Giraffe short read mapper relies on the graph's snarl decomposition. The versions of the Cactus/Minigraph graphs released here contain some spurious large deletion edges that make this decomposition less efficient, which impacts Giraffe runtime. Furthermore, we have found that for calling small variants with the Giraffe-DeepVariant pipeline, accuracy is improved if all alleles with frequency < 10% are removed from the graph before indexing. Two filtered versions of each of the two Minigraph/Cactus graphs are available here. The graphs with maxdel.10mb
in the name (recommended to speed up general mapping experiments) were created by removing edges that imply deletions > 10mb, and the graphs with minaf.0.1
in the name (recommended when using with DeepVariant) were created by removing, in addition to the deletions, nodes that are covered by fewer than 9 haplotypes. In order to use vg versions older than v1.44.0 with these graphs, download the .dist.old
and .min.old
indexes and rename them to .dist
and .min
(update 4/18/2023: All .dist
and .min
indexes have been regenerated using a patched vg to fix a speed regression. They remain compatible with vg versions >= v1.44.0).
Highly repetitive sequence such as found in centromeres was excluded from the Minigraph/Cactus graphs using the following process. dna-brnn was first run with its default parameters and model to identify alpha satellite and hsat 2/3 regions >100kb, which were clipped out of the input fasta files. Gaps >100kb between minigraph mappings were likewise removed. Any remaining contigs or contig fragments that could not be assigned to a reference chromosome were excluded. Finally, gaps >10kb left unaligned after Cactus were removed. Please note that no sequence was removed from the reference genome of either graph. Each removed interval, as well as the step it was removed by, are available:
- regions removed from GRCh38-based graph: hprc-v1.0-mc-grch38.clipped-intervals.bed.gz
- regions removed from CHM13-based graph: hprc-v1.0-mc-chm13.clipped-intervals.bed.gz