-
Notifications
You must be signed in to change notification settings - Fork 28
How variants are chosen for the consensus sequence
This page goes into more detail on how Trycycler produces a consensus sequence. Specifically, when faced with multiple different variants of a sequence, how does it choose which one is best?
Take this hypothetical MSA as an input to Trycycler consensus:
GGAGGAGCTTTT-CGCCGCAGTCAACGAA-TAGCGTCTGAAAACGTGTATCATGGGTCTTGCCTCGAAAAACACT
GGAGGAGCTTTTTCGCCGCAGTCAAC--A-TAGCGTCTGAAAACGTGTATCATAAATCTTGCCTCGAAAACCACT
GGAGGAGCTTTTTCGCCGCAGTCAAC--ATTAGCGTCTGAAAACGTGTATCATAAATCTTGCCTCGAAAAGCACT
GGAGGAGCTTTTTCGCCGCAGTCAAC--A-TAGCGTCTGAAAACGTGTATCATGGGTCTTGCCTCGAAAACCACT
GGAGGAGCTTTT-CGCCGCAGTCAAC--A-TAGCGTCTGAAAACGTGTATCATATATCTTGCCTCGAAAAGCACT
Trycycler first divides the MSA into 'same' and 'different' chunks:
GGAGGAGCTTTT - CGCCGCAGTCAAC GAA- TAGCGTCTGAAAACGTGTATCAT GGG TCTTGCCTCGAAAA A CACT
GGAGGAGCTTTT T CGCCGCAGTCAAC --A- TAGCGTCTGAAAACGTGTATCAT AAA TCTTGCCTCGAAAA C CACT
GGAGGAGCTTTT T CGCCGCAGTCAAC --AT TAGCGTCTGAAAACGTGTATCAT AAA TCTTGCCTCGAAAA G CACT
GGAGGAGCTTTT T CGCCGCAGTCAAC --A- TAGCGTCTGAAAACGTGTATCAT GGG TCTTGCCTCGAAAA C CACT
GGAGGAGCTTTT - CGCCGCAGTCAAC --A- TAGCGTCTGAAAACGTGTATCAT ATA TCTTGCCTCGAAAA G CACT
You can also think of it like a variation graph:
↗ - ↘ ↗ GAA- ↘ ↗ GGG ↘ ↗ A ↘
↗ T ↘ ↗ --A- ↘ ↗ AAA ↘ ↗ C ↘
GGAGGAGCTTTT → T → CGCCGCAGTCAAC → --AT → TAGCGTCTGAAAACGTGTATCAT → AAA → TCTTGCCTCGAAAA → G → CACT
↘ T ↗ ↘ --A- ↗ ↘ GGG ↗ ↘ C ↗
↘ - ↗ ↘ --A- ↗ ↘ ATA ↗ ↘ G ↗
A decision must now be made for each 'different' chunk: which variant should go in the consensus?
The first thing Trycycler uses to choose variants is minimum total Hamming distance to the other variants.
Let's work through the first point of variation, where there are five options (-
, T
, T
, T
, -
) and two unique options (-
, T
). For each unique option, we get the sum of its distances to each of the options. Since our sequences are already aligned (they came from an MSA), we can use Hamming distance which makes it easier.
Here are the values in a table, where the different options are in columns, the unique options are in rows, the values show Hamming distances, and the total Hamming distance for each unique option is in the rightmost column:
- |
T |
T |
T |
- |
total | |
---|---|---|---|---|---|---|
- |
0 | 1 | 1 | 1 | 0 | 3 |
T |
1 | 0 | 0 | 0 | 1 | 2 |
Since T
's total Hamming distance of 2 is lower than -
's total distance of 3, the preferred variant is T
.
You can see that for simple cases like this, choosing the variant with the minimum total Hamming distance is equivalent to choosing the most common variant.
Let's apply the same logic to the second point of variation which has three unique options:
GAA- |
--A- |
--AT |
--A- |
--A- |
total | |
---|---|---|---|---|---|---|
GAA- |
0 | 2 | 3 | 2 | 2 | 9 |
--A- |
2 | 0 | 1 | 0 | 0 | 3 |
--AT |
3 | 1 | 0 | 1 | 1 | 6 |
The minimum total Hamming distance is for the --A-
variant, which once again happens to be the most common option. So far so good!
Let's do it again on the third point of variation:
GGG |
AAA |
AAA |
GGG |
ATA |
total | |
---|---|---|---|---|---|---|
GGG |
0 | 3 | 3 | 0 | 3 | 9 |
AAA |
3 | 0 | 0 | 3 | 1 | 7 |
ATA |
3 | 1 | 1 | 3 | 0 | 8 |
For this variant, AAA
is the preferred option. Unlike the previous two cases, we couldn't come to the same conclusion by just choosing the most common option. Both AAA
and GGG
tie as most common (two instances), but minimum total Hamming distance favours AAA
because it's closer to the ATA
option.
Once more for the last variant:
A |
C |
G |
C |
G |
total | |
---|---|---|---|---|---|---|
A |
0 | 1 | 1 | 1 | 1 | 4 |
C |
1 | 0 | 1 | 0 | 1 | 3 |
G |
1 | 1 | 0 | 1 | 0 | 3 |
Uh oh – we've got a tie! The minimum total Hamming distance rules out A
, but both C
and G
are viable options.
In cases of a minimum total Hamming distance tie, Trycycler will align the reads to each possibility and choose the option with the highest total read alignment score. Read alignment scores are calculated using a simple scheme: match score = 1, mismatch penalty = 1, gap open/extension penalty = 1.
Using different chunk #4 as an example, we'll first align reads to the C
version of the sequence:
read 1: GAAAAGCA-T score= 6
read 2: CGAAAAGCAC score= 8
read 3: GCCTCG-AAACCACT score=13
read 4: TG-CTCGAAAAGCA score=10
------------------------------------------------------------------------
GGAGGAGCTTTTTCGCCGCAGTCAACATAGCGTCTGAAAACGTGTATCATAAATCTTGCCTCGAAAACCACT
↑
And then we'll align reads to the G
version of the sequence:
read 1: GAAAAGCA-T score= 8
read 2: CGAAAAGCAC score=10
read 3: GCCTCG-AAACCACT score=11
read 4: TG-CTCGAAAAGCA score=12
------------------------------------------------------------------------
GGAGGAGCTTTTTCGCCGCAGTCAACATAGCGTCTGAAAACGTGTATCATAAATCTTGCCTCGAAAAGCACT
↑
The C
version got a total read alignment score of 37 and the G
version got a total read alignment score of 41. So the G
variant is in better agreement with the reads and that's the one that Trycycler will choose for the consensus.
Trycycler's final consensus sequence is simply the sequence which includes the chosen option for each different chunk:
GGAGGAGCTTTTTCGCCGCAGTCAACATAGCGTCTGAAAACGTGTATCATAAATCTTGCCTCGAAAAGCACT
- Home
- Software requirements
- Installation
-
How to run Trycycler
- Quick start
- Step 1: Generating assemblies
- Step 2: Clustering contigs
- Step 3: Reconciling contigs
- Step 4: Multiple sequence alignment
- Step 5: Partitioning reads
- Step 6: Generating a consensus
- Step 7: Polishing after Trycycler
- Illustrated pipeline overview
- Demo datasets
- Implementation details
- FAQ and miscellaneous tips
- Other pages
- Guide to bacterial genome assembly (choose your own adventure)
- Accuracy vs depth