Skip to content

Good dataset analysis

Ryan Wick edited this page Jul 13, 2020 · 14 revisions

Trycycler cluster

To begin, I ran Trycycler cluster on the assemblies:

trycycler cluster --reads reads.fastq.gz --assemblies assemblies/*.fasta --out_dir trycycler

Which produced this output. Four clusters were made, the first of which seems to be the the chromosome (10 contigs, one from each input assembly) and the second cluster (six contigs) seems to be the plasmid. Clusters 3 and 4 are singletons with only one contig each, both from assembly B.

Looking at the tree helps to see what happened here:

Good dataset tree

Clusters 3 and 4 are both near cluster 2, indicating that they share a lot of k-mers with cluster 2. I would therefore conclude that they are fragmented/incomplete versions of the plasmid, i.e. assembly B botched the plasmid sequence.

I would like to keep clusters 1 and 2 but not 3 and 4, so I rename the bad ones to get them out of the way:

mv trycycler/cluster_003 trycycler/bad_cluster_003
mv trycycler/cluster_004 trycycler/bad_cluster_004

Prefixing the clusters with bad_ will allow me to later glob for all good clusters with trycycler/cluster_*. Alternatively, you could just delete the bad cluster directories.

Trycycler reconcile – cluster 1

I then ran Trycycler reconcile on the first (chromosomal) cluster:

trycycler reconcile --reads reads.fastq.gz --cluster_dir trycycler/cluster_001

Which produced this output. The initial check and circularisation steps did well, but there was a problem with the final alignments. Contig F_utg000001c had a markedly lower identity to the other contigs, indicating that something is terribly wrong in its sequence. I can also see that the circularised version of F_utg000001c is about 25 kbp longer than the other sequences, so perhaps it contains some sort of erroneous duplication.

I then removed this bad sequence:

mv trycycler/cluster_001/1_contigs/F_utg000001c.fasta trycycler/cluster_001/1_contigs/F_utg000001c.fasta.bad

Adding .bad to the end of the sequence name excludes it, as Trycycler grabs all *.fasta files in the 1_contigs directory. Alternatively, you could just delete the file.

I then ran Trycycler reconcile again:

trycycler reconcile --reads reads.fastq.gz --cluster_dir trycycler/cluster_001

Which produced this output. This time it completed without error and everything looks good!

Trycycler reconcile – cluster 2

I then ran Trycycler reconcile on the second (plasmid) cluster:

trycycler reconcile --reads reads.fastq.gz --cluster_dir trycycler/cluster_002

Which produced this output. The command stopped with an error almost immediately because one of the sequences (D_contig_2) is much longer than the rest – about twice as long. This makes me suspect a common mistake that long-read assemblers make with small plasmids: doubling the sequence in a single contig.

At this point we could just remove D_contig_2.fasta (by renaming it to end with .bad like we did in cluster 1), but since we only have six sequences in this cluster, I'm going to try to repair the sequence instead.

To do so, I open the sequence in a text editor and randomly select a piece of sequence near the start (avoiding the region very close the start of the contig, as sometimes contig sequences are less reliable near their start/end). I ended up using the 30-mer CCAAACTCTACAAAAGAGTTTCAATCGATC but plenty of other sequences would of course work too. Searching for this sequence reveals two hits: one near the start (where I originally got the sequence) and another in the second half of the contig. The spacing between these two is ~7.5 kbp, which happens to be the length of the other contigs in cluster 2. This all supports my suspicion that D_contig_2 is a doubled version of the plasmid. I therefore fix the sequence by deleting everything before the first hit (1476 bp) and everything after the second hit (including the second hit itself, 5994 bp). This resulted in a fixed D_contig_2 sequence of 7471 bp.

I then tried running Trycycler reconcile again:

trycycler reconcile --reads reads.fastq.gz --cluster_dir trycycler/cluster_002

Which produced this output, and this time it finished!

You can see in the output that during the circularisation step, D_contig_2 was deemed already circular – consistent with our manual intervention for that sequence. This cluster's final alignment check looks okay. Two sequences (G_utg000002c and I_utg000002c) are a bit messier than the rest, but they aren't too bad so I'll leave them in. Whether I leave or remove slightly-worse-than-the-rest contigs depends on how many sequences I have in the cluster. E.g. if there were 12 sequences in the cluster and two looked a bit messy, I would exclude them (and then re-run Trycycler reconcile) because 10 sequences is still plenty. But in this case we only have six sequences in the cluster and removing them would take us down to only four, so I'm inclined to leave them in.

Trycycler MSA, partition and consensus

The hard work is now behind us! Time to run Trycycler MSA on each cluster:

trycycler msa --cluster_dir trycycler/cluster_001
trycycler msa --cluster_dir trycycler/cluster_002

(cluster 1 MSA output and cluster 2 MSA output)

In case you're curious, you can peak inside the MSA for cluster 2 and see why those two contigs (G_utg000002c and I_utg000002c) had worse alignments: they each contain a modest insertion. I_utg000002c's insertion is ~1150 bp into the alignment and G_utg000002c's insertion is ~6400 bp in.

Then Trycycler partition:

trycycler partition --reads reads.fastq.gz --cluster_dirs trycycler/cluster_*

(partition output)

Then Trycycler consensus on each cluster:

trycycler consensus --cluster_dir trycycler/cluster_001
trycycler consensus --cluster_dir trycycler/cluster_002

(cluster 1 consensus output and cluster 2 consensus output)

All of these steps ran without any problem!

I then concatenated each cluster's consensus into a single FASTA for the genome:

cat trycycler/cluster_*/7_final_consensus.fasta > assembly.fasta

Final thoughts

This example involved a bit of manual intervention and judgement, but we were still able to get a clean assembly. Comparing the resulting consensus assembly back to the reference reveals plenty of homopolymer-length errors (created by systematic errors in the input reads) but other than that, the assembly looks great!

Clone this wiki locally