You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Maintain two hash tables for each SV:
SV -> set(allele sequences)
SV -> count(supporting reads)
For each PositionedSequence, generate the kmers of the sequence, tracking the global position (i.e., chrom and position) for each kmer.
Insert each kmer into the following hash tables:
kmer -> set(SV)
kmer -> set(position)
kmer -> frequency(kmer)
Place the position of each kmer into an IntervalTree, or a hash table of floor(variant.position % 10 kilobases):
position (+/- slop) -> set(SV)
This provides a query index for all SVs in the VCF. The number of kmers grows with the number/size/complexity of the input SVs.
Now iterate over a set of query reads. For every read, maintain:
a set of compatible positions
a counting set of compatible SVs
For each kmer in a given read:
Retrieve the frequency of the kmer; if it is less than a set threshold, proceed, otherwise continue to the next kmer.
Retrieve the kmer's set of compatible SVs. Add these to the read's counting set.
Retrieve the kmer's set of compatible positions. Add these to the read's position set.
Each read now has a set of compatible SVs / positions based on frequency-filtered kmers. One could stop here and proceed (as kallisto does) and try to estimate the most probable SV.
However, there are probably still many erroneous SV matches. Further filter the compatible SVs as follows:
Filter the compatible SVs by frequency. In the simplest form, take the max of the counting set as the matched SV. Alternatively, trim any SVs in the compatible set which have fewer than some number of counts (essentially a filter on the minimum number of matched kmers to that SV).
Filter the compatible SVs based on position. Compute the number of overlapping positions, allowing slop, for each position in the compatible set. Remove any positions which do not overlap with at least one other position. Next, generate a set of compatible SVs from these positions. Compute the intersection of this set and the compatible SV set of the read. This step will likely fail completely when SVs overlap significantly.
If a read's compatible SV set contains only a single SV, or an SV in its compatible set reaches a given threshold of supporting counts, Increment the read support counter for the SV to reflect that the read supports that SV.
Edit 1: I think it's maybe possible to use exact kmer matching or a kmer frequency histogram between the alleles of the compatible SV and the region of the read containing compatible kmers as well, but I haven't really worked that out yet, and pinning down the "start" of the read region could be difficult.
Edit 2: I think in this indexing scheme we are only dependent on the number of kmers in each SV, not the total number in the reference. It might still be useful to filter very common kmers.
The text was updated successfully, but these errors were encountered:
Using #8 , I'm imagining one possible indexing/query scheme below (loosely based on Kallisto's equivalence classes):
Maintain two hash tables for each SV:
SV -> set(allele sequences)
SV -> count(supporting reads)
For each PositionedSequence, generate the kmers of the sequence, tracking the global position (i.e., chrom and position) for each kmer.
Insert each kmer into the following hash tables:
kmer -> set(SV)
kmer -> set(position)
kmer -> frequency(kmer)
Place the position of each kmer into an IntervalTree, or a hash table of floor(variant.position % 10 kilobases):
position (+/- slop) -> set(SV)
This provides a query index for all SVs in the VCF. The number of kmers grows with the number/size/complexity of the input SVs.
Now iterate over a set of query reads. For every read, maintain:
a set of compatible positions
a counting set of compatible SVs
For each kmer in a given read:
Each read now has a set of compatible SVs / positions based on frequency-filtered kmers. One could stop here and proceed (as kallisto does) and try to estimate the most probable SV.
However, there are probably still many erroneous SV matches. Further filter the compatible SVs as follows:
If a read's compatible SV set contains only a single SV, or an SV in its compatible set reaches a given threshold of supporting counts, Increment the read support counter for the SV to reflect that the read supports that SV.
Edit 1: I think it's maybe possible to use exact kmer matching or a kmer frequency histogram between the alleles of the compatible SV and the region of the read containing compatible kmers as well, but I haven't really worked that out yet, and pinning down the "start" of the read region could be difficult.
Edit 2: I think in this indexing scheme we are only dependent on the number of kmers in each SV, not the total number in the reference. It might still be useful to filter very common kmers.
The text was updated successfully, but these errors were encountered: