[indexing] Position Aware SV Equivalence Classes #9

edawson · 2020-10-12T03:09:25Z

Using #8 , I'm imagining one possible indexing/query scheme below (loosely based on Kallisto's equivalence classes):

Maintain two hash tables for each SV:
SV -> set(allele sequences)
SV -> count(supporting reads)

For each PositionedSequence, generate the kmers of the sequence, tracking the global position (i.e., chrom and position) for each kmer.

Insert each kmer into the following hash tables:
kmer -> set(SV)
kmer -> set(position)
kmer -> frequency(kmer)

Place the position of each kmer into an IntervalTree, or a hash table of floor(variant.position % 10 kilobases):
position (+/- slop) -> set(SV)

This provides a query index for all SVs in the VCF. The number of kmers grows with the number/size/complexity of the input SVs.

Now iterate over a set of query reads. For every read, maintain:
a set of compatible positions
a counting set of compatible SVs

For each kmer in a given read:

Retrieve the frequency of the kmer; if it is less than a set threshold, proceed, otherwise continue to the next kmer.
Retrieve the kmer's set of compatible SVs. Add these to the read's counting set.
Retrieve the kmer's set of compatible positions. Add these to the read's position set.

Each read now has a set of compatible SVs / positions based on frequency-filtered kmers. One could stop here and proceed (as kallisto does) and try to estimate the most probable SV.

However, there are probably still many erroneous SV matches. Further filter the compatible SVs as follows:

Filter the compatible SVs by frequency. In the simplest form, take the max of the counting set as the matched SV. Alternatively, trim any SVs in the compatible set which have fewer than some number of counts (essentially a filter on the minimum number of matched kmers to that SV).
Filter the compatible SVs based on position. Compute the number of overlapping positions, allowing slop, for each position in the compatible set. Remove any positions which do not overlap with at least one other position. Next, generate a set of compatible SVs from these positions. Compute the intersection of this set and the compatible SV set of the read. This step will likely fail completely when SVs overlap significantly.

If a read's compatible SV set contains only a single SV, or an SV in its compatible set reaches a given threshold of supporting counts, Increment the read support counter for the SV to reflect that the read supports that SV.

Edit 1: I think it's maybe possible to use exact kmer matching or a kmer frequency histogram between the alleles of the compatible SV and the region of the read containing compatible kmers as well, but I haven't really worked that out yet, and pinning down the "start" of the read region could be difficult.

Edit 2: I think in this indexing scheme we are only dependent on the number of kmers in each SV, not the total number in the reference. It might still be useful to filter very common kmers.

…empty-med argmed: handle case where seq was empty after filtering

edawson added the braindump label Oct 12, 2020

brentp added a commit to brentp/nibsv that referenced this issue Oct 14, 2021

Merge pull request collaborativebioinformatics#9 from fritzsedlazeck/…

b1b9e06

…empty-med argmed: handle case where seq was empty after filtering

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[indexing] Position Aware SV Equivalence Classes #9

[indexing] Position Aware SV Equivalence Classes #9

edawson commented Oct 12, 2020 •

edited

Loading

[indexing] Position Aware SV Equivalence Classes #9

[indexing] Position Aware SV Equivalence Classes #9

Comments

edawson commented Oct 12, 2020 • edited Loading

edawson commented Oct 12, 2020 •

edited

Loading