Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[indexing] Position Aware SV Equivalence Classes #9

Open
edawson opened this issue Oct 12, 2020 · 0 comments
Open

[indexing] Position Aware SV Equivalence Classes #9

edawson opened this issue Oct 12, 2020 · 0 comments

Comments

@edawson
Copy link
Collaborator

edawson commented Oct 12, 2020

Using #8 , I'm imagining one possible indexing/query scheme below (loosely based on Kallisto's equivalence classes):

Maintain two hash tables for each SV:
SV -> set(allele sequences)
SV -> count(supporting reads)

For each PositionedSequence, generate the kmers of the sequence, tracking the global position (i.e., chrom and position) for each kmer.

Insert each kmer into the following hash tables:
kmer -> set(SV)
kmer -> set(position)
kmer -> frequency(kmer)

Place the position of each kmer into an IntervalTree, or a hash table of floor(variant.position % 10 kilobases):
position (+/- slop) -> set(SV)

This provides a query index for all SVs in the VCF. The number of kmers grows with the number/size/complexity of the input SVs.

Now iterate over a set of query reads. For every read, maintain:
a set of compatible positions
a counting set of compatible SVs

For each kmer in a given read:

  1. Retrieve the frequency of the kmer; if it is less than a set threshold, proceed, otherwise continue to the next kmer.
  2. Retrieve the kmer's set of compatible SVs. Add these to the read's counting set.
  3. Retrieve the kmer's set of compatible positions. Add these to the read's position set.

Each read now has a set of compatible SVs / positions based on frequency-filtered kmers. One could stop here and proceed (as kallisto does) and try to estimate the most probable SV.

However, there are probably still many erroneous SV matches. Further filter the compatible SVs as follows:

  1. Filter the compatible SVs by frequency. In the simplest form, take the max of the counting set as the matched SV. Alternatively, trim any SVs in the compatible set which have fewer than some number of counts (essentially a filter on the minimum number of matched kmers to that SV).
  2. Filter the compatible SVs based on position. Compute the number of overlapping positions, allowing slop, for each position in the compatible set. Remove any positions which do not overlap with at least one other position. Next, generate a set of compatible SVs from these positions. Compute the intersection of this set and the compatible SV set of the read. This step will likely fail completely when SVs overlap significantly.

If a read's compatible SV set contains only a single SV, or an SV in its compatible set reaches a given threshold of supporting counts, Increment the read support counter for the SV to reflect that the read supports that SV.

Edit 1: I think it's maybe possible to use exact kmer matching or a kmer frequency histogram between the alleles of the compatible SV and the region of the read containing compatible kmers as well, but I haven't really worked that out yet, and pinning down the "start" of the read region could be difficult.

Edit 2: I think in this indexing scheme we are only dependent on the number of kmers in each SV, not the total number in the reference. It might still be useful to filter very common kmers.

brentp added a commit to brentp/nibsv that referenced this issue Oct 14, 2021
…empty-med

argmed: handle case where seq was empty after filtering
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant