Skip to content

How variants are chosen for the consensus sequence

Ryan Wick edited this page Jan 5, 2021 · 15 revisions

This page goes into more detail on how Trycycler produces a consensus sequence. Specifically, when faced with multiple different variants of a sequence, how does it choose which one is best?

Breaking the sequence into chunks

Take this hypothetical MSA as an input to Trycycler consensus:

GGAGGAGCTTTT-CGCCGCAGTCAACGAA-TAGCGTCTGAAAACGTGTATCATGGGTCTTGCCTCGAAAAACACT
GGAGGAGCTTTTTCGCCGCAGTCAAC--A-TAGCGTCTGAAAACGTGTATCATAAATCTTGCCTCGAAAACCACT
GGAGGAGCTTTTTCGCCGCAGTCAAC--ATTAGCGTCTGAAAACGTGTATCATAAATCTTGCCTCGAAAAGCACT
GGAGGAGCTTTTTCGCCGCAGTCAAC--A-TAGCGTCTGAAAACGTGTATCATGGGTCTTGCCTCGAAAACCACT
GGAGGAGCTTTT-CGCCGCAGTCAAC--A-TAGCGTCTGAAAACGTGTATCATATATCTTGCCTCGAAAAGCACT

Trycycler first divides the MSA into 'same' and 'different' chunks:

GGAGGAGCTTTT   -   CGCCGCAGTCAAC   GAA-   TAGCGTCTGAAAACGTGTATCAT   GGG   TCTTGCCTCGAAAA   A   CACT
GGAGGAGCTTTT   T   CGCCGCAGTCAAC   --A-   TAGCGTCTGAAAACGTGTATCAT   AAA   TCTTGCCTCGAAAA   C   CACT
GGAGGAGCTTTT   T   CGCCGCAGTCAAC   --AT   TAGCGTCTGAAAACGTGTATCAT   AAA   TCTTGCCTCGAAAA   G   CACT
GGAGGAGCTTTT   T   CGCCGCAGTCAAC   --A-   TAGCGTCTGAAAACGTGTATCAT   GGG   TCTTGCCTCGAAAA   C   CACT
GGAGGAGCTTTT   -   CGCCGCAGTCAAC   --A-   TAGCGTCTGAAAACGTGTATCAT   ATA   TCTTGCCTCGAAAA   G   CACT

You can also think of it like a variation graph:

             ↗ - ↘               ↗ GAA- ↘                         ↗ GGG ↘                ↗ A ↘     
             ↗ T ↘               ↗ --A- ↘                         ↗ AAA ↘                ↗ C ↘     
GGAGGAGCTTTT → T → CGCCGCAGTCAAC → --AT → TAGCGTCTGAAAACGTGTATCAT → AAA → TCTTGCCTCGAAAA → G → CACT
             ↘ T ↗               ↘ --A- ↗                         ↘ GGG ↗                ↘ C ↗     
             ↘ - ↗               ↘ --A- ↗                         ↘ ATA ↗                ↘ G ↗     

A decision must now be made for each 'different' chunk: which variant should go in the consensus?

Minimum total Hamming distance

The first thing Trycycler uses to choose variants is minimum total Hamming distance to the other variants.

Different chunk #1

Let's work through the first point of variation, where there are five options (-, T, T, T, -) and two unique options (-, T). For each unique option, we get the sum of its distances to each of the options. Since our sequences are already aligned (they came from an MSA), we can use Hamming distance which makes it easier.

Here are the values in a table, where the different options are in columns, the unique options are in rows, the values show Hamming distances, and the total Hamming distance for each unique option is in the rightmost column:

- T T T - total
- 0 1 1 1 0 3
T 1 0 0 0 1 2

Since T's total Hamming distance of 2 is lower than -'s total distance of 3, the preferred variant is T.

You can see that for simple cases like this, choosing the variant with the minimum total Hamming distance is equivalent to choosing the most common variant.

Different chunk #2

Let's apply the same logic to the second point of variation which has three unique options:

GAA- --A- --AT --A- --A- total
GAA- 0 2 3 2 2 9
--A- 2 0 1 0 0 3
--AT 3 1 0 1 1 6

The minimum total Hamming distance is for the --A- variant, which once again happens to be the most common option. So far so good!

Different chunk #3

Let's do it again on the third point of variation:

GGG AAA AAA GGG ATA total
GGG 0 3 3 0 3 9
AAA 3 0 0 3 1 7
ATA 3 1 1 3 0 8

For this variant, AAA is the preferred option. Unlike the previous two cases, we couldn't come to the same conclusion by just choosing the most common option. Both AAA and GGG tie as most common (two instances), but minimum total Hamming distance favours AAA because it's closer to the ATA option.

Different chunk #4

Once more for the last variant:

A C G C G total
A 0 1 1 1 1 4
C 1 0 1 0 1 3
G 1 1 0 1 0 3

Uh oh – we've got a tie! The minimum total Hamming distance rules out A, but both C and G are viable options.

Read alignment

In cases of a minimum total Hamming distance tie, Trycycler will align the reads to each possibility and choose the option with the highest total read alignment score. Read alignment scores are calculated using a simple scheme: match score = 1, mismatch penalty = 1, gap open/extension penalty = 1.

Using different chunk #4 as an example, we'll first align reads to the C version of the sequence:

                                                read 1:       GAAAAGCA-T  score= 6
                                                read 2:      CGAAAAGCAC   score= 8
                                                read 3:  GCCTCG-AAACCACT  score=13
                                                read 4: TG-CTCGAAAAGCA    score=10
------------------------------------------------------------------------
GGAGGAGCTTTTTCGCCGCAGTCAACATAGCGTCTGAAAACGTGTATCATAAATCTTGCCTCGAAAACCACT
                                                                   ↑

And then we'll align reads to the G version of the sequence:

                                                read 1:       GAAAAGCA-T  score= 8
                                                read 2:      CGAAAAGCAC   score=10
                                                read 3:  GCCTCG-AAACCACT  score=11
                                                read 4: TG-CTCGAAAAGCA    score=12
------------------------------------------------------------------------
GGAGGAGCTTTTTCGCCGCAGTCAACATAGCGTCTGAAAACGTGTATCATAAATCTTGCCTCGAAAAGCACT
                                                                   ↑

The C version got a total read alignment score of 37 and the G version got a total read alignment score of 41. So the G variant is in better agreement with the reads and that's the one that Trycycler will choose for the consensus.

Final consensus

Trycycler's final consensus sequence is simply the sequence which includes the chosen option for each different chunk:

GGAGGAGCTTTTTCGCCGCAGTCAACATAGCGTCTGAAAACGTGTATCATAAATCTTGCCTCGAAAAGCACT
Clone this wiki locally