-
Notifications
You must be signed in to change notification settings - Fork 28
Good dataset analysis
To begin, I ran Trycycler cluster on the assemblies:
trycycler cluster --reads reads.fastq.gz --assemblies assemblies/*.fasta --out_dir trycycler
Which produced this output. Four clusters were made, the first of which seems to be the the chromosome (10 contigs, one from each input assembly) and the second cluster (six contigs) seems to be the plasmid. Clusters 3 and 4 are singletons with only one contig each, both from assembly B.
Looking at the tree helps to see what happened here:
Clusters 3 and 4 are both near cluster 2, indicating that they share a lot of k-mers with cluster 2. I would therefore conclude that they are fragmented/incomplete versions of the plasmid, i.e. assembly B botched the plasmid sequence.
I would like to keep clusters 1 and 2 but not 3 and 4, so I rename the bad ones to get them out of the way:
mv trycycler/cluster_003 trycycler/bad_cluster_003
mv trycycler/cluster_004 trycycler/bad_cluster_004
Prefixing the clusters with bad_
will allow me to later glob for all good clusters with trycycler/cluster_*
. Alternatively, you could just delete the bad cluster directories.
I then ran Trycycler reconcile on the first (chromosomal) cluster:
trycycler reconcile --reads reads.fastq.gz --cluster_dir trycycler/cluster_001
Which produced this output. The initial check and circularisation steps did well, but there was a problem with the final alignments. Contig F_utg000001c
had a markedly lower identity to the other contigs, indicating that something is terribly wrong in its sequence. I can also see that the circularised version of F_utg000001c
is about 25 kbp longer than the other sequences, so perhaps it contains some sort of erroneous duplication.
I then removed this bad sequence:
mv trycycler/cluster_001/1_contigs/F_utg000001c.fasta trycycler/cluster_001/1_contigs/F_utg000001c.fasta.bad
Adding .bad
to the end of the sequence name excludes it, as Trycycler grabs all *.fasta
files in the 1_contigs
directory. Alternatively, you could just delete the file.
I then ran Trycycler reconcile again:
trycycler reconcile --reads reads.fastq.gz --cluster_dir trycycler/cluster_001
Which produced this output. This time it completed without error and everything looks good!
I then ran Trycycler reconcile on the second (plasmid) cluster:
trycycler reconcile --reads reads.fastq.gz --cluster_dir trycycler/cluster_002
Which produced this output. The command stopped with an error almost immediately because one of the sequences (D_contig_2
) is much longer than the rest – about twice as long. This makes me suspect a common mistake that long-read assemblers make with small plasmids: doubling the sequence in a single contig.
At this point we could just remove D_contig_2.fasta
(by renaming it to end with .bad
like we did in cluster 1), but since we only have six sequences in this cluster, I'm going to try to repair the sequence instead.
To do so, I open the sequence in a text editor and randomly select a piece of sequence near the start (avoiding the region very close the start of the contig, as sometimes contig sequences are less reliable near their start/end). I ended up using the 30-mer CCAAACTCTACAAAAGAGTTTCAATCGATC
but plenty of other sequences would of course work too. Searching for this sequence reveals two hits: one near the start (where I originally got the sequence) and another in the second half of the contig. The spacing between these two is ~7.5 kbp, which happens to be the length of the other contigs in cluster 2. This all supports my suspicion that D_contig_2
is a doubled version of the plasmid. I therefore fix the sequence by deleting everything before the first hit (1476 bp) and everything after the second hit (including the second hit itself, 5994 bp). This resulted in a fixed D_contig_2
sequence of 7471 bp.
I then tried running Trycycler reconcile again:
trycycler reconcile --reads reads.fastq.gz --cluster_dir trycycler/cluster_002
Which produced this output, and this time it finished!
You can see in the output that during the circularisation step, D_contig_2
was deemed already circular – consistent with our manual intervention for that sequence. This cluster's final alignment check looks okay. Two sequences (G_utg000002c
and I_utg000002c
) are a bit messier than the rest, but they aren't too bad so I'll leave them in. Whether I leave or remove slightly-worse-than-the-rest contigs depends on how many sequences I have in the cluster. E.g. if there were 12 sequences in the cluster and two looked a bit messy, I would exclude them (and then re-run Trycycler reconcile) because 10 sequences is still plenty. But in this case we only have six sequences in the cluster and removing them would take us down to only four, so I'm inclined to leave them in.
The hard work is now behind us! Time to run Trycycler MSA on each cluster:
trycycler msa --cluster_dir trycycler/cluster_001
trycycler msa --cluster_dir trycycler/cluster_002
(cluster 1 MSA output and cluster 2 MSA output)
In case you're curious, you can peak inside the MSA for cluster 2 and see why those two contigs (G_utg000002c
and I_utg000002c
) had worse alignments: they each contain a modest insertion. I_utg000002c
's insertion is ~1150 bp into the alignment and G_utg000002c
's insertion is ~6400 bp in.
Then Trycycler partition:
trycycler partition --reads reads.fastq.gz --cluster_dirs trycycler/cluster_*
Then Trycycler consensus on each cluster:
trycycler consensus --cluster_dir trycycler/cluster_001
trycycler consensus --cluster_dir trycycler/cluster_002
(cluster 1 consensus output and cluster 2 consensus output)
All of these steps ran without any problem!
I then concatenated each cluster's consensus into a single FASTA for the genome:
cat trycycler/cluster_*/7_final_consensus.fasta > assembly.fasta
This example involved a bit of manual intervention and judgement, but we were still able to get a clean assembly. Comparing the resulting consensus assembly back to the reference reveals plenty of homopolymer-length errors (created by systematic errors in the input reads) but other than that, the assembly looks great!
- Home
- Software requirements
- Installation
-
How to run Trycycler
- Quick start
- Step 1: Generating assemblies
- Step 2: Clustering contigs
- Step 3: Reconciling contigs
- Step 4: Multiple sequence alignment
- Step 5: Partitioning reads
- Step 6: Generating a consensus
- Step 7: Polishing after Trycycler
- Illustrated pipeline overview
- Demo datasets
- Implementation details
- FAQ and miscellaneous tips
- Other pages
- Guide to bacterial genome assembly (choose your own adventure)
- Accuracy vs depth