Skip to content

Good dataset analysis

Ryan Wick edited this page Jul 6, 2020 · 14 revisions

Trycycler cluster

To begin, I ran Trycycler cluster on the assemblies:

trycycler cluster --reads reads.fastq.gz --assemblies assemblies/*.fasta --out_dir trycycler

Which produced this output.

Four clusters were made, the first of which seems to be the the chromosome (10 contigs, one from each input assembly) and the second cluster (six contigs) seems to be the plasmid. Clusters 3 and 4 are singletons with only one contig each, both from assembly B.

Looking at the tree helps to see what happened here. Clusters 3 and 4 are both near cluster 2, implying that they share a lot of k-mers in common with the plasmid. I would therefore conclude that they are fragmented/incomplete versions of the plasmid, i.e. assembly B botched the plasmid sequence.

I would like to keep clusters 1 and 2 but not 3 and 4, so I rename the bad ones to get them out of the way:

mv trycycler/cluster_003 trycycler/bad_cluster_003
mv trycycler/cluster_004 trycycler/bad_cluster_004

Prefixing the clusters with bad_ will allow me to later glob for all good clusters with trycycler/cluster_*. Alternatively, you could just delete the directories.

Trycycler reconcile – cluster 1

I then ran Trycycler reconcile on the first (chromosomal) cluster:

trycycler reconcile --reads reads.fastq.gz --cluster_dir trycycler/cluster_001

Which produced this output.

The initial check and circularisation steps did well, but there was a problem with the final alignments. Contig F_utg000001c had a markedly lower identity to the other contigs, indicating that something is terribly wrong in its sequence. I can also see that the circularised version of F_utg000001c is about 25 kbp longer than the other sequences, so perhaps it contains some sort of erroneous duplication.

I then removed this bad sequence:

mv trycycler/cluster_001/1_contigs/F_utg000001c.fasta trycycler/cluster_001/1_contigs/F_utg000001c.fasta.bad

Adding .bad to the end of the sequence name excludes it, as Trycycler grabs all *.fasta files in the 1_contigs directory. Alternatively, you could just delete the file.

I then ran Trycycler reconcile again:

trycycler reconcile --reads reads.fastq.gz --cluster_dir trycycler/cluster_001

Which produced this output. This time it completed without error and everything looks good!

Trycycler reconcile – cluster 2

I then ran Trycycler reconcile on the second (plasmid) cluster:

trycycler reconcile --reads reads.fastq.gz --cluster_dir trycycler/cluster_002

Which produced this output.

The command stops with an error almost immediately because one of the sequences (D_contig_2) is much longer than the rest – about twice as long. This makes me suspect a common mistake that assemblers make with small plasmids: doubling the sequence in a single contig.

At this point we could just remove D_contig_2.fasta (by renaming it to end with .bad like we did in cluster 1), but since we only have six sequences in this cluster, I'm going to try to repair the sequence instead.

To do so, I open the sequence in a text editor and randomly select a piece of sequence near the start (avoiding the region very close the start of the contig, as sometimes contig sequences are less reliable near their start/end). I ended up using the 30-mer CCAAACTCTACAAAAGAGTTTCAATCGATC but plenty of other sequences would of course work too. Searching for this sequence reveals two hits: near the start where originally got the sequence and another hit in the second half of the contig. The spacing between these two is ~7.5 kbp, which happens to be the length of the other contigs in cluster 2. This all supports my suspicion that D_contig_2 is a doubled version of the plasmid. I therefore fix the sequence by deleting everything before the first hit (1476 bp) and everything after the second hit (including the second hit itself, 5994 bp). This resulted in a fixed D_contig_2 sequence of 7471 bp.

I then tried running Trycycler reconcile again:

trycycler reconcile --reads reads.fastq.gz --cluster_dir trycycler/cluster_002

Which produced this output, and this time it finished!

You can see in the output that during the circularisation step, D_contig_2 was deemed already circular – consistent with our manual intervention for that sequence.

This cluster's final alignment check looks okay. One sequence (G_utg000002c) is a bit messier than the rest, but it's not too bad so I'll leave it in. Whether I leave or remove slightly-worse-than-the-rest contig depends on how many sequences I have in the cluster. E.g. if there were 12 sequences in the cluster and one looks a bit messy, I would exclude it (and then re-run Trycycler reconcile) because 11 sequences is still plenty. But in this case we only have six sequences in the cluster, so I'm more inclined to leave it in.

Clone this wiki locally