You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Outline of discussion at team meeting this morning regarding design / implementation / work distribution
The brief description:
Load each SV into a set of indexes.
Assign each read to one (or few) SV(s).
Tally the number of reads that support each SV and output a call (either presence/absence or genotype) for each SV.
Detailed description:
Generate ComposedSequences from input SVs
a. Attach flanking sequence from reference genome to "alt" SV allele (Alt allele for INS, empty seq for DEL)
b. Create a PositionedSequence structure, which contains the chrom / pos and ComposedSeq
c. Return a sequence of PositionedSequence structures
d. (NEEDED) return the hash/identifier of the SV
Kmerize the ComposedSequences
a. Write a function kmerize_composed_sequence(PositionedSequence) -> seq(kmers)
b. (stretchgoal) Write a function kmerize_composed_sequence_with_position(PositionedSequence) -> tuple(seq(kmers), seq(position))
Create a SVIndex, which contains information for SVs / SV kmers and allows fast lookups.
a. Create datastructures:
- CountTable [kmer] -> count
- hash_table [kmer] -> seq(SV)
- (stretchgoal) (genotyping) hash_table [SV] -> vcf::Variant
- (stretchgoal) (genotyping) hash_table[SV] -> seq(Alleles)
- (stretchgoal) hash_table[Position] -> seq(SV)
b. Load kmers from Step 2 into index.
For each kmer in SV:
- CountTable[kmer] ++
- hash_table[kmer].add(SV)
c. (stretchgoal) (genotyping) Load SV alleles into hash_tables.
d. Serialize these tables to disk.
Remove reference kmers
a. Stream over the reference genome. For each kmer:
get the count of that kmer in the SVIndex
IFF the kmer's count is > 0, it has been observed in an SV. Mask SV kmers observed in the reference by setting the count to INTMAX.
b. Verify that the following conditions are met:
Querying the SVIndex::CountTable for a kmer that is in an SV but NOT in the ref returns a value v (0 < v < INTMAX)
Querying the SVIndex::CountTable for a kmer that is in an SV AND in the ref returns a value of INTMAX
Querying the SVIndex::CountTable for a kmer that is not in an SV returns a value of 0.
Classify each read to 0, 1, or multiple SVs.
a. Implement a Read datastructure that can track compatible SVs for that read.
e.g., CountTable[SV] -> count
b. For each kmer in a read:
a. Query the SVIndex::KmerCountTable. If the kmer's frequency is 0, INTMAX, or V > threshold_V, ignore the kmer (i.e., continue the loop)
b. Query the SVIndex::KmerSVTable. For each SV in the returned seq(SV), increment the read's [SV] -> count
c. Create a function that returns the classification of the read (e.g., classify() -> seq(SV) or classify() -> SV)
Some proposed classification strategies include just taking the max of the read's compatible SV counts, or filtering counts above some min.
Track read supports per-sample
a. Implement a CountTable from SV -> readcounts
b. For each read, increment the ReadSupportCountTable[SV] for each of the SVs returned by classify().
c. Output presence / absence of a given SV based on read supports.
- e.g., Read in the SV VCF. For each SV, look up its counts in the ReadSupportsCountTable. Annotate the VCF variant's INFO field with a flag or value.
(stretchgoal) Implement genotyping
a. Rather than track CountTable (SV) -> readcounts, track CountTable (Allele) -> readcounts. This will require adding structures to track kmers from specific alleles.
The text was updated successfully, but these errors were encountered:
Outline of discussion at team meeting this morning regarding design / implementation / work distribution
The brief description:
Detailed description:
Generate ComposedSequences from input SVs
a. Attach flanking sequence from reference genome to "alt" SV allele (Alt allele for INS, empty seq for DEL)
b. Create a PositionedSequence structure, which contains the chrom / pos and ComposedSeq
c. Return a sequence of PositionedSequence structures
d. (NEEDED) return the hash/identifier of the SV
Kmerize the ComposedSequences
a. Write a function kmerize_composed_sequence(PositionedSequence) -> seq(kmers)
b. (stretchgoal) Write a function kmerize_composed_sequence_with_position(PositionedSequence) -> tuple(seq(kmers), seq(position))
Create a SVIndex, which contains information for SVs / SV kmers and allows fast lookups.
a. Create datastructures:
- CountTable [kmer] -> count
- hash_table [kmer] -> seq(SV)
- (stretchgoal) (genotyping) hash_table [SV] -> vcf::Variant
- (stretchgoal) (genotyping) hash_table[SV] -> seq(Alleles)
- (stretchgoal) hash_table[Position] -> seq(SV)
b. Load kmers from Step 2 into index.
For each kmer in SV:
- CountTable[kmer] ++
- hash_table[kmer].add(SV)
c. (stretchgoal) (genotyping) Load SV alleles into hash_tables.
d. Serialize these tables to disk.
Remove reference kmers
a. Stream over the reference genome. For each kmer:
b. Verify that the following conditions are met:
Classify each read to 0, 1, or multiple SVs.
a. Implement a Read datastructure that can track compatible SVs for that read.
e.g., CountTable[SV] -> count
b. For each kmer in a read:
a. Query the SVIndex::KmerCountTable. If the kmer's frequency is 0, INTMAX, or V > threshold_V, ignore the kmer (i.e., continue the loop)
b. Query the SVIndex::KmerSVTable. For each SV in the returned seq(SV), increment the read's [SV] -> count
c. Create a function that returns the classification of the read (e.g., classify() -> seq(SV) or classify() -> SV)
Track read supports per-sample
a. Implement a CountTable from SV -> readcounts
b. For each read, increment the ReadSupportCountTable[SV] for each of the SVs returned by classify().
c. Output presence / absence of a given SV based on read supports.
- e.g., Read in the SV VCF. For each SV, look up its counts in the ReadSupportsCountTable. Annotate the VCF variant's INFO field with a flag or value.
(stretchgoal) Implement genotyping
a. Rather than track CountTable (SV) -> readcounts, track CountTable (Allele) -> readcounts. This will require adding structures to track kmers from specific alleles.
The text was updated successfully, but these errors were encountered: