Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[workplan] SV index and read kmerizing #10

Open
edawson opened this issue Oct 12, 2020 · 0 comments
Open

[workplan] SV index and read kmerizing #10

edawson opened this issue Oct 12, 2020 · 0 comments
Labels
braindump documentation Improvements or additions to documentation

Comments

@edawson
Copy link
Collaborator

edawson commented Oct 12, 2020

Outline of discussion at team meeting this morning regarding design / implementation / work distribution

The brief description:

  1. Load each SV into a set of indexes.
  2. Assign each read to one (or few) SV(s).
  3. Tally the number of reads that support each SV and output a call (either presence/absence or genotype) for each SV.

Detailed description:

  1. Generate ComposedSequences from input SVs
    a. Attach flanking sequence from reference genome to "alt" SV allele (Alt allele for INS, empty seq for DEL)
    b. Create a PositionedSequence structure, which contains the chrom / pos and ComposedSeq
    c. Return a sequence of PositionedSequence structures
    d. (NEEDED) return the hash/identifier of the SV

  2. Kmerize the ComposedSequences
    a. Write a function kmerize_composed_sequence(PositionedSequence) -> seq(kmers)
    b. (stretchgoal) Write a function kmerize_composed_sequence_with_position(PositionedSequence) -> tuple(seq(kmers), seq(position))

  3. Create a SVIndex, which contains information for SVs / SV kmers and allows fast lookups.
    a. Create datastructures:
    - CountTable [kmer] -> count
    - hash_table [kmer] -> seq(SV)
    - (stretchgoal) (genotyping) hash_table [SV] -> vcf::Variant
    - (stretchgoal) (genotyping) hash_table[SV] -> seq(Alleles)
    - (stretchgoal) hash_table[Position] -> seq(SV)
    b. Load kmers from Step 2 into index.
    For each kmer in SV:
    - CountTable[kmer] ++
    - hash_table[kmer].add(SV)
    c. (stretchgoal) (genotyping) Load SV alleles into hash_tables.
    d. Serialize these tables to disk.

  4. Remove reference kmers
    a. Stream over the reference genome. For each kmer:

    • get the count of that kmer in the SVIndex
    • IFF the kmer's count is > 0, it has been observed in an SV. Mask SV kmers observed in the reference by setting the count to INTMAX.
      b. Verify that the following conditions are met:
      1. Querying the SVIndex::CountTable for a kmer that is in an SV but NOT in the ref returns a value v (0 < v < INTMAX)
      2. Querying the SVIndex::CountTable for a kmer that is in an SV AND in the ref returns a value of INTMAX
      3. Querying the SVIndex::CountTable for a kmer that is not in an SV returns a value of 0.
  5. Classify each read to 0, 1, or multiple SVs.
    a. Implement a Read datastructure that can track compatible SVs for that read.
    e.g., CountTable[SV] -> count
    b. For each kmer in a read:
    a. Query the SVIndex::KmerCountTable. If the kmer's frequency is 0, INTMAX, or V > threshold_V, ignore the kmer (i.e., continue the loop)
    b. Query the SVIndex::KmerSVTable. For each SV in the returned seq(SV), increment the read's [SV] -> count
    c. Create a function that returns the classification of the read (e.g., classify() -> seq(SV) or classify() -> SV)

    • Some proposed classification strategies include just taking the max of the read's compatible SV counts, or filtering counts above some min.
  6. Track read supports per-sample
    a. Implement a CountTable from SV -> readcounts
    b. For each read, increment the ReadSupportCountTable[SV] for each of the SVs returned by classify().
    c. Output presence / absence of a given SV based on read supports.
    - e.g., Read in the SV VCF. For each SV, look up its counts in the ReadSupportsCountTable. Annotate the VCF variant's INFO field with a flag or value.

  7. (stretchgoal) Implement genotyping
    a. Rather than track CountTable (SV) -> readcounts, track CountTable (Allele) -> readcounts. This will require adding structures to track kmers from specific alleles.

@edawson edawson added braindump documentation Improvements or additions to documentation labels Oct 12, 2020
edawson pushed a commit that referenced this issue Oct 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
braindump documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

1 participant