Skip to content

Latest commit

 

History

History
30 lines (26 loc) · 1.77 KB

README.md

File metadata and controls

30 lines (26 loc) · 1.77 KB

A tandem repeat (TR) catalog generated from high-quality long-read human genome assemblies

This repository keeps the analysis scripts that were used to generated the TR catalog from public diploid long-read human genome assemblies from the following data soucres:

  1. Human Pangenome Reference Consortium (HPRC)
  2. Human Genome Structural Variation Consortium (HGSVC2)
  3. 1000G ONT Sequencing Consortium

Workflow

workflow

Mapping of TRs from assemblies to the reference genome

Catalog

v1 v2

  • haplotype names separated by semi-colons are shown in first header line preceded by '#'
  • column descriptions:
Column Description
chrom chromosome
start start coordinate
end end coordinate
motif consensus repeat motif
copy_numbers copy numbers in haplotypes separated by semi-colons ('-' for missing genotypes)
sizes sizes (bp) in haplotypes separated by semi-colons ('-' for missing genotypes)
motifs motifs in haplotypes separated by semi-colons ('-' for missing genotypes)
max_change maximum change (of all haplotypes) in size (bp) substracted from reference genome size
num_samples number of samples with genotype
num_calls number of haplotypes with genotype
motif_frequency number of haplotypes associated with each motif observed e.g. CAG(10);CAA(2)
feature gene element overlapped. Format: gene|transcript|, where = exon#|intron#|utr5|utr3|cds|promoter|exon_bound (exon boundary)