[workplan] SV index and read kmerizing #10

edawson · 2020-10-12T18:02:24Z

Outline of discussion at team meeting this morning regarding design / implementation / work distribution

The brief description:

Load each SV into a set of indexes.
Assign each read to one (or few) SV(s).
Tally the number of reads that support each SV and output a call (either presence/absence or genotype) for each SV.

Detailed description:

Generate ComposedSequences from input SVs
a. Attach flanking sequence from reference genome to "alt" SV allele (Alt allele for INS, empty seq for DEL)
b. Create a PositionedSequence structure, which contains the chrom / pos and ComposedSeq
c. Return a sequence of PositionedSequence structures
d. (NEEDED) return the hash/identifier of the SV
Kmerize the ComposedSequences
a. Write a function kmerize_composed_sequence(PositionedSequence) -> seq(kmers)
b. (stretchgoal) Write a function kmerize_composed_sequence_with_position(PositionedSequence) -> tuple(seq(kmers), seq(position))
Create a SVIndex, which contains information for SVs / SV kmers and allows fast lookups.
a. Create datastructures:
- CountTable [kmer] -> count
- hash_table [kmer] -> seq(SV)
- (stretchgoal) (genotyping) hash_table [SV] -> vcf::Variant
- (stretchgoal) (genotyping) hash_table[SV] -> seq(Alleles)
- (stretchgoal) hash_table[Position] -> seq(SV)
b. Load kmers from Step 2 into index.
For each kmer in SV:
- CountTable[kmer] ++
- hash_table[kmer].add(SV)
c. (stretchgoal) (genotyping) Load SV alleles into hash_tables.
d. Serialize these tables to disk.
Remove reference kmers
a. Stream over the reference genome. For each kmer:
- get the count of that kmer in the SVIndex
- IFF the kmer's count is > 0, it has been observed in an SV. Mask SV kmers observed in the reference by setting the count to INTMAX.
  b. Verify that the following conditions are met:
  1. Querying the SVIndex::CountTable for a kmer that is in an SV but NOT in the ref returns a value v (0 < v < INTMAX)
  2. Querying the SVIndex::CountTable for a kmer that is in an SV AND in the ref returns a value of INTMAX
  3. Querying the SVIndex::CountTable for a kmer that is not in an SV returns a value of 0.
Classify each read to 0, 1, or multiple SVs.
a. Implement a Read datastructure that can track compatible SVs for that read.
e.g., CountTable[SV] -> count
b. For each kmer in a read:
a. Query the SVIndex::KmerCountTable. If the kmer's frequency is 0, INTMAX, or V > threshold_V, ignore the kmer (i.e., continue the loop)
b. Query the SVIndex::KmerSVTable. For each SV in the returned seq(SV), increment the read's [SV] -> count
c. Create a function that returns the classification of the read (e.g., classify() -> seq(SV) or classify() -> SV)
- Some proposed classification strategies include just taking the max of the read's compatible SV counts, or filtering counts above some min.
Track read supports per-sample
a. Implement a CountTable from SV -> readcounts
b. For each read, increment the ReadSupportCountTable[SV] for each of the SVs returned by classify().
c. Output presence / absence of a given SV based on read supports.
- e.g., Read in the SV VCF. For each SV, look up its counts in the ReadSupportsCountTable. Annotate the VCF variant's INFO field with a flag or value.
(stretchgoal) Implement genotyping
a. Rather than track CountTable (SV) -> readcounts, track CountTable (Allele) -> readcounts. This will require adding structures to track kmers from specific alleles.

History

edawson added braindump documentation Improvements or additions to documentation labels Oct 12, 2020

edawson pushed a commit that referenced this issue Oct 18, 2021

Merge pull request #10 from fritzsedlazeck/history

26b0f1e

History

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[workplan] SV index and read kmerizing #10

[workplan] SV index and read kmerizing #10

edawson commented Oct 12, 2020

[workplan] SV index and read kmerizing #10

[workplan] SV index and read kmerizing #10

Comments

edawson commented Oct 12, 2020