-
Notifications
You must be signed in to change notification settings - Fork 28
Clustering contigs
Before this step, you'll need multiple separate assemblies for your isolate (see Generating assemblies). I will assume they can be found here: assemblies/*.fasta
.
You'll also need the long-read set for your isolate. I'd recommend using the entire long-read set here, perhaps with some light QC, but not one of the subsampled sets you used for assembly. However, if you have an enormous number of reads, some additional QC filtering might be prudent for the sake of speed performance. I'll assume the reads are in a file named reads.fastq
.
The goal of this step is to cluster the contigs of your input assemblies into per-replicon groups. It also serves to exclude any spurious, incomplete or badly misassembled contigs.
For example, let's assume you're assembling a genome with one chromosome and two plasmids. Ideally, each of your input assemblies would have three contigs (one for each of the replicons), but that may not be the case. Perhaps one of the assemblies is missing a plasmid. Perhaps one of the assemblies has a spurious extra contig. Or perhaps one of the assemblies has accidentally assembled the two plasmids together into a single contig. This step will help you to identify three contig clusters (one per replicon) and exclude any problematic contigs.
If your contigs do not form clear clusters, that indicates that the input assemblies are inconsistent and unreliable. If you find yourself in this situation (struggling to identify which clusters are good and which are bad), then you probably need to get better long-read data (longer and/or deeper) and try again.
trycycler cluster --assemblies assemblies/*.fasta --reads reads.fastq --out_dir trycycler
This command will carry out complete-linkage clustering of all contig sequences based on their Mash distance.
This step also assigns letters to each of the input assemblies. This is mainly for brevity, so a long filename like e_coli_sample_45_flye_assembly_1_read_subset_1.fasta
can be subsequently referred to as A
in Trycycler's output.
Most of Trycycler cluster's time is spent aligning reads to assemblies (to assign contig depths). Larger read sets will therefore be slower, and using more threads will make it faster. Assuming a typical-sized read set and a dozen or so threads, I'd expect it to finish in ~10 minutes or less.
In addition to the required arguments (input assemblies, a long-read set and an output directory), there are a few parameters you can tweak:
-
--min_contig_len
: contigs shorter than this are thrown out on the assumption that they are either incomplete or spurious. The default value is 1000, as plasmids smaller than that are very rare. If you are assembling a genome with exceptionally small plasmids (e.g. Variovorax), you should reduce this value as appropriate. -
--min_contig_depth
: this controls how Trycycler filters out contigs with a low read depth. It is a multiple of the mean read depth for the assembly. For example, if an assembly has a mean depth of 90× and this setting is 0.1 (the default), then any contig with <9× depth will be removed. -
--distance
: this is the Mash distance threshold used when defining clusters, and the default threshold is 0.01. Smaller thresholds (e.g. 0.005) can result in a larger number of tighter clusters. Larger thresholds (e.g. 0.02) can result in a smaller number of looser clusters. -
--threads
: this is how many threads Trycycler will use for read alignment. It will only affect the speed performance, so you'll probably want to use as many threads as you have available.
Trycycler cluster will create an output directory (trycycler
in the example above) which contains the following:
-
contigs.phylip
: a matrix of the Mash distances between all contigs in PHYLIP format. -
contigs.newick
: a FastME tree of the contigs built from the distance matrix. This can be visualised in a phylogenetic tree viewer such as FigTree, Dendroscope or Archaeopteryx. - One directory for each of the clusters:
cluster_001
,cluster_002
, etc. These directories will each contain a subdirectory named1_contigs
which includes the FASTA files for the contigs in the cluster.
After running Trycycler cluster, it is up to you to choose which of the clusters are good and which are bad. This can be somewhat subjective, so there is not an exact procedure for you to follow (if there was I would have automated the decision in Trycycler). You'll have to rely on your scientific acumen!
Generally speaking, a good cluster contains many contigs (ideally one from each assembly) which are all very close to each other and have realistic read depths. A bad cluster contains a small number of contigs (maybe just one) which might have low read depths. The tree can be useful in making these decisions, though interpret it with a grain of salt, as the contig sequences are not necessarily related in a tree-like manner.
If you have prior knowledge about what your genome should look like, that information can be quite useful in deciding which clusters are good. E.g. if you happened to know that your genome contains a 150 kbp plasmid, then you can expect one of your good clusters to have contigs of about that size.
You might also decide at this point that the default value for --distance
(0.01) was not quite right. E.g. if your tree contains two very close clusters that you think should actually be one cluster, you can run Trycycler cluster again with a larger distance threshold.
Another thing to keep in mind: contamination can happen. I most often see this occur with cross-barcode contamination, where a contig in one assembly actually belongs to a different genome from the same multiplexed sequencing run. Read more about this on Trycycler's FAQ.
If and when you decide that a particular cluster is bad, you should delete its entire directory (e.g. cluster_002
) at this point. Alternatively, you can rename the directory (e.g. changing it to bad_cluster_002
) so we can later glob for good clusters with cluster_*
. You can also delete individual contigs from a cluster (e.g. trycycler/cluster_002/1_contigs/A_contig_3.fasta
) that you do not want to use. You should be left with only the good clusters before proceeding to the next step (Trycycler reconcile).
If some of your input assemblies look quite bad (i.e. their contigs do not neatly cluster with contigs from other assemblies), you might want to:
- Throw out the bad assemblies (delete their FASTA files).
- Generate some fresh input assemblies using different read subsets and/or assemblers.
- Run
trycycler cluster
again and hopefully get better clusters.
It's up to you as to whether it's worth going down this road. As a rough guide, if I had 12 input assemblies and one looked quite bad, I would just choose my clusters and move on. But if I had 12 input assemblies and seven were bad, I would throw out the bad ones, generate some new assemblies and try clustering again.
Since this step involves human judgement, I'll show some examples so you know what kind of things to expect.
![]() If you look closely, you'll notice some odd contig lengths in cluster 3: two ~6 kbp, one ~9 kbp and two ~12 kbp. Despite these length differences, they have clustered tightly together (i.e. they have very similar k-mers according to Mash). The most likely explanation is that there were some circularisation issues with this plasmid – an unfortunately common problem with small plasmids in long-read assemblies. Perhaps the real plasmid size is ~6 kbp, but some assemblies have duplicated it entirely (to ~12 kbp) and one assembly partially duplicated it (to ~9 kbp). This cluster is okay for now, but the length differences will need to be rectified in the next step (Trycycler reconcile). |
![]() Cluster 2 looks okay, though there are two closely-related singleton clusters: 4 and 6. Since clusters 4 and 6 only have one sequence each, my best guess is that they represent fragments or misassemblies of the cluster 2 plasmid and should therefore be excluded. Cluster 5 is distant from everything else and has a low read depth, so it should certainly be excluded. I would therefore consider clusters 1, 2 and 3 to the be the good clusters in this example. Clusters 4, 5 and 6 should be deleted:
|
![]() Cluster 2 has a couple of oddities: two of its contigs (E_Utg1 and F_Utg0) didn't group with the others as nicely in the tree, and two others (G_ctg2 and H_ctg2) seem a bit shorter than the rest. You could delete the strange contigs now and reduce the cluster to the four other contigs (A_contig_2, B_contig_3, C_utg000003c and D_utg000003c). Or you could just proceed and see if they cause problems later when running Trycycler reconcile (and delete them then if they do). Cluster 3 looks good. Its closeness to cluster 2 on the tree probably indicates that the cluster 2 plasmid and cluster 3 plasmid share some sequence in common. Cluster 4 only has two contigs, but it looks like there are two other contigs (A_contig_6 and B_contig_6) which are close but didn't quite make it into the same cluster. It might be worth rerunning Trycycler cluster with a slightly larger --distance value so those combine into a single cluster of four contigs.Clusters 5 and 6 look okay, though it's a shame that they don't have more contigs. This isn't too surprising though – small plasmids are sometimes tough for long-read assemblers. |
![]()
Clusters 5 and 6 are interesting. My first guess is that they represent the same plasmid but were just a bit too divergent to cluster together (and again had a length duplication). It would be worth investigating a bit to see if this is indeed the case (e.g. BLASTing those contig sequences to known plasmids). If it is, I would try running Trycycler cluster again with a slightly larger --distance value to get those two clusters to combine into one.
|
![]()
|
![]() The rest of the contigs are confusing. While there are some comprehensible parts (cluster 6 looks nice), there is a lot of strangeness going on. For example, cluster 7 has tons of representatives, often multiple ones from the same assembly (especially D). It's not clear what's going on here, but I don't like it. A case like this raises a lot of questions, so I would not proceed with the Trycycler pipeline. Perhaps you could get cleaner assemblies by tweaking the assembler parameters or by QC filtering the input reads. Or it may be that the reads are insufficient, and longer/deeper sequencing will be required to run Trycycler on this genome. |
- Home
- Software requirements
- Installation
-
How to run Trycycler
- Quick start
- Step 1: Generating assemblies
- Step 2: Clustering contigs
- Step 3: Reconciling contigs
- Step 4: Multiple sequence alignment
- Step 5: Partitioning reads
- Step 6: Generating a consensus
- Step 7: Polishing after Trycycler
- Illustrated pipeline overview
- Demo datasets
- Implementation details
- FAQ and miscellaneous tips
- Other pages
- Guide to bacterial genome assembly (choose your own adventure)
- Accuracy vs depth