Skip to content

Clustering contigs

Ryan Wick edited this page Jan 11, 2021 · 50 revisions

Requirements

Before this step, you'll need multiple separate assemblies for your isolate (see Generating assemblies). I will assume they can be found here: assemblies/*.fasta.

You'll also need the long-read set for your isolate. I'd recommend using the entire long-read set here, perhaps with some light QC, but not one of the subsampled sets you used for assembly. However, if you have an enormous number of reads, some additional QC filtering might be prudent for the sake of speed performance. I'll assume the reads are in a file named reads.fastq.

Concept

The goal of this step is to cluster the contigs of your input assemblies into per-replicon groups. It also serves to exclude any spurious, incomplete or badly misassembled contigs.

For example, let's assume you're assembling a genome with one chromosome and two plasmids. Ideally, each of your input assemblies would have three contigs (one for each of the replicons), but that may not be the case. Perhaps one of the assemblies is missing a plasmid. Perhaps one of the assemblies has a spurious extra contig. Or perhaps one of the assemblies has accidentally assembled the two plasmids together into a single contig. This step will help you to identify three contig clusters (one per replicon) and exclude any problematic contigs.

If your contigs do not form clear clusters, that indicates that the input assemblies are inconsistent and unreliable. If you find yourself in this situation (struggling to identify which clusters are good and which are bad), then you probably need to get better long-read data (longer and/or deeper) and try again.

Running Trycycler cluster

trycycler cluster --assemblies assemblies/*.fasta --reads reads.fastq --out_dir trycycler

This command will carry out complete-linkage clustering of all contig sequences based on their Mash distance.

This step also assigns letters to each of the input assemblies. This is mainly for brevity, so a long filename like e_coli_sample_45_flye_assembly_1_read_subset_1.fasta can be subsequently referred to as A in Trycycler's output.

Most of Trycycler cluster's time is spent aligning reads to assemblies (to assign contig depths). Larger read sets will therefore be slower, and using more threads will make it faster. Assuming a typical-sized read set and a dozen or so threads, I'd expect it to finish in ~10 minutes or less.

Settings

In addition to the required arguments (input assemblies, a long-read set and an output directory), there are a few parameters you can tweak:

  • --min_contig_len: contigs shorter than this are thrown out on the assumption that they are either incomplete or spurious. The default value is 1000, as plasmids smaller than that are very rare. If you are assembling a genome with exceptionally small plasmids (e.g. Variovorax), you should reduce this value as appropriate.
  • --min_contig_depth: this controls how Trycycler filters out contigs with a low read depth. It is a multiple of the mean read depth for the assembly. For example, if an assembly has a mean depth of 90× and this setting is 0.1 (the default), then any contig with <9× depth will be removed.
  • --distance: this is the Mash distance threshold used when defining clusters, and the default threshold is 0.01. Smaller thresholds (e.g. 0.005) can result in a larger number of tighter clusters. Larger thresholds (e.g. 0.02) can result in a smaller number of looser clusters.
  • --threads: this is how many threads Trycycler will use for read alignment. It will only affect the speed performance, so you'll probably want to use as many threads as you have available.

Output

Trycycler cluster will create an output directory (trycycler in the example above) which contains the following:

  • contigs.phylip: a matrix of the Mash distances between all contigs in PHYLIP format.
  • contigs.newick: a FastME tree of the contigs built from the distance matrix. This can be visualised in a phylogenetic tree viewer such as FigTree, Dendroscope or Archaeopteryx.
  • One directory for each of the clusters: cluster_001, cluster_002, etc. These directories will each contain a subdirectory named 1_contigs which includes the FASTA files for the contigs in the cluster.

Choose your clusters

After running Trycycler cluster, it is up to you to choose which of the clusters are good and which are bad. This can be somewhat subjective, so there is not an exact procedure for you to follow (if there was I would have automated the decision in Trycycler). You'll have to rely on your scientific acumen!

Generally speaking, a good cluster contains many contigs (ideally one from each assembly) which are all very close to each other and have realistic read depths. A bad cluster contains a small number of contigs (maybe just one) which might have low read depths. The tree can be useful in making these decisions, though interpret it with a grain of salt, as the contig sequences are not necessarily related in a tree-like manner.

If you have prior knowledge about what your genome should look like, that information can be quite useful in deciding which clusters are good. E.g. if you happened to know that your genome contains a 150 kbp plasmid, then you can expect one of your good clusters to have contigs of about that size.

You might also decide at this point that the default value for --distance (0.01) was not quite right. E.g. if your tree contains two very close clusters that you think should actually be one cluster, you can run Trycycler cluster again with a larger distance threshold.

Another thing to keep in mind: contamination can happen. I most often see this occur with cross-barcode contamination, where a contig in one assembly actually belongs to a different genome from the same multiplexed sequencing run. Read more about this on Trycycler's FAQ.

If and when you decide that a particular cluster is bad, you should delete its entire directory (e.g. cluster_002) at this point. Alternatively, you can rename the directory (e.g. changing it to bad_cluster_002) so we can later glob for good clusters with cluster_*. You can also delete individual contigs from a cluster (e.g. trycycler/cluster_002/1_contigs/A_contig_3.fasta) that you do not want to use. You should be left with only the good clusters before proceeding to the next step (Trycycler reconcile).

Generating more assemblies (if needed)

If some of your input assemblies look quite bad (i.e. their contigs do not neatly cluster with contigs from other assemblies), you might want to:

  • Throw out the bad assemblies (delete their FASTA files).
  • Generate some fresh input assemblies using different read subsets and/or assemblers.
  • Run trycycler cluster again and hopefully get better clusters.

It's up to you as to whether it's worth going down this road. As a rough guide, if I had 12 input assemblies and one looked quite bad, I would just choose my clusters and move on. But if I had 12 input assemblies and seven were bad, I would throw out the bad ones, generate some new assemblies and try clustering again.

Clustering examples

Since this step involves human judgement, I'll show some examples so you know what kind of things to expect.

Example 1

Clustering example 1 This is a tidy clustering example. Every contig falls into one of three tight clusters. Cluster 1 contains long sequences and so probably is the chromosome. Clusters 2 and 3 have shorter sequences and probably are plasmids.

If you look closely, you'll notice some odd contig lengths in cluster 3: two ~6 kbp, one ~9 kbp and two ~12 kbp. Despite these length differences, they have clustered tightly together (i.e. they have very similar k-mers according to Mash). The most likely explanation is that there were some circularisation issues with this plasmid – an unfortunately common problem with small plasmids in long-read assemblies. Perhaps the real plasmid size is ~6 kbp, but some assemblies have duplicated it entirely (to ~12 kbp) and one assembly partially duplicated it (to ~9 kbp). This cluster is okay for now, but the length differences will need to be rectified in the next step (Trycycler reconcile).

Example 2

Clustering example 2 Clusters 1 and 3 look pretty solid in this example. However, one contig in cluster 3 (G_ctg3) has a bizarre length compared to the rest of the cluster and will need to be fixed/excluded in the next step (Trycycler reconcile).

Cluster 2 looks okay, though there are two closely-related singleton clusters: 4 and 6. Since clusters 4 and 6 only have one sequence each, my best guess is that they represent fragments or misassemblies of the cluster 2 plasmid and should therefore be excluded. Cluster 5 is distant from everything else and has a low read depth, so it should certainly be excluded.

I would therefore consider clusters 1, 2 and 3 to the be the good clusters in this example. Clusters 4, 5 and 6 should be deleted:
rm -r trycycler/cluster_004
rm -r trycycler/cluster_005
rm -r trycycler/cluster_006

Example 3

Clustering example 3 Cluster 1 in this example looks like the chromosome and is straightforward.

Cluster 2 has a couple of oddities: two of its contigs (E_Utg1 and F_Utg0) didn't group with the others as nicely in the tree, and two others (G_ctg2 and H_ctg2) seem a bit shorter than the rest. You could delete the strange contigs now and reduce the cluster to the four other contigs (A_contig_2, B_contig_3, C_utg000003c and D_utg000003c). Or you could just proceed and see if they cause problems later when running Trycycler reconcile (and delete them then if they do).

Cluster 3 looks good. Its closeness to cluster 2 on the tree probably indicates that the cluster 2 plasmid and cluster 3 plasmid share some sequence in common.

Cluster 4 only has two contigs, but it looks like there are two other contigs (A_contig_6 and B_contig_6) which are close but didn't quite make it into the same cluster. It might be worth rerunning Trycycler cluster with a slightly larger --distance value so those combine into a single cluster of four contigs.

Clusters 5 and 6 look okay, though it's a shame that they don't have more contigs. This isn't too surprising though – small plasmids are sometimes tough for long-read assemblers.

Example 4

Clustering example 4 Cluster 1 in this example looks good. Cluster 2 is close to cluster 1 but contains a shorter contig (probably an incomplete assembly of the chromosome) and can be thrown out:
rm -r trycycler/cluster_002
A similar situation has happened with clusters 3 and 7: cluster 3 looks good and cluster 7 is an outgroup with a single shorter contig. I would toss out cluster 7:
rm -r trycycler/cluster_007
Cluster 4 looks good except that the small plasmid duplication problem has shown up again: two of the contigs are double the length of the other two. These will need to be manually fixed up in the next step (Trycycler reconcile).

Clusters 5 and 6 are interesting. My first guess is that they represent the same plasmid but were just a bit too divergent to cluster together (and again had a length duplication). It would be worth investigating a bit to see if this is indeed the case (e.g. BLASTing those contig sequences to known plasmids). If it is, I would try running Trycycler cluster again with a slightly larger --distance value to get those two clusters to combine into one.

Example 5

Clustering example 5 Most of the clusters in this example are clean: 1, 2, 3, 4, 5 and 6 all look pretty good. Cluster 7 looks spurious and I would throw it out. Cluster 8 looks like a not-quite-right assembly of the cluster 6 plasmid, and I'd throw it out as well:
rm -r trycycler/cluster_007
rm -r trycycler/cluster_008
Cluster 9 is a tricky one. It could be a very small plasmid which only showed up in two of the eight input assemblies, in which case it would be legit. Or it could be that assemblies A and B both created the same spurious nonsense. A closer look at those sequences (e.g. BLASTing to known plasmid sequences) would be warranted to decide which is the case.

Example 6

Clustering example 6 This is a particularly messy example! The chromosome looks to be in cluster 1, though it's unfortunate that only four of the eight assemblies (A, B, C and D) seem to have assembled it cleanly. The other four (E, F, G and H) have created shorter sequences that fell into clusters 2 and 3.

The rest of the contigs are confusing. While there are some comprehensible parts (cluster 6 looks nice), there is a lot of strangeness going on. For example, cluster 7 has tons of representatives, often multiple ones from the same assembly (especially D). It's not clear what's going on here, but I don't like it.

A case like this raises a lot of questions, so I would not proceed with the Trycycler pipeline. Perhaps you could get cleaner assemblies by tweaking the assembler parameters or by QC filtering the input reads. Or it may be that the reads are insufficient, and longer/deeper sequencing will be required to run Trycycler on this genome.
Clone this wiki locally