-
Notifications
You must be signed in to change notification settings - Fork 28
Good dataset analysis
To begin, I ran Trycycler cluster on the assemblies:
trycycler cluster --reads reads.fastq.gz --assemblies assemblies/*.fasta --out_dir trycycler
Which produced this output.
Four clusters were made, the first of which seems to be the the chromosome (10 contigs, one from each input assembly) and the second cluster (six contigs) seems to be the plasmid. Clusters 3 and 4 are singletons with only one contig each, both from assembly B.
Looking at the tree helps to see what happened here. Clusters 3 and 4 are both near cluster 2, implying that they share a lot of k-mers in common with the plasmid. I would therefore conclude that they are fragmented/incomplete versions of the plasmid, i.e. assembly B botched the plasmid sequence.
I would like to keep clusters 1 and 2 but not 3 and 4, so I rename the bad ones to get them out of the way:
mv trycycler/cluster_003 trycycler/bad_cluster_003
mv trycycler/cluster_004 trycycler/bad_cluster_004
Prefixing the clusters with bad_
will allow me to later glob for all good clusters with trycycler/cluster_*
. Alternatively, you could just delete the directories.
I then ran Trycycler reconcile on the first (chromosomal) cluster:
trycycler reconcile --reads reads.fastq.gz --cluster_dir trycycler/cluster_001
Which produced this output.
The initial check and circularisation steps did well, but there was a problem with the final alignments. Contig F_utg000001c
had a markedly lower identity to the other contigs, indicating that something is terribly wrong in its sequence. I can also see that the circularised version of F_utg000001c
is about 25 kbp longer than the other sequences, so perhaps it contains some sort of erroneous duplication.
I then removed this bad sequence:
mv trycycler/cluster_001/1_contigs/F_utg000001c.fasta trycycler/cluster_001/1_contigs/F_utg000001c.fasta.bad
Adding .bad
to the end of the sequence name excludes it, as Trycycler grabs all *.fasta
files in the 1_contigs
directory. Alternatively, you could just delete the file.
I then ran Trycycler reconcile again:
trycycler reconcile --reads reads.fastq.gz --cluster_dir trycycler/cluster_001
Which produced this output. This time it completed without error and everything looks good!
I then ran Trycycler reconcile on the second (plasmid) cluster:
trycycler reconcile --reads reads.fastq.gz --cluster_dir trycycler/cluster_002
Which produced this output.
The command stops with an error almost immediately because one of the sequences (D_contig_2
) is much longer than the rest – about twice as long. This makes me suspect a common mistake that assemblers make with small plasmids: doubling the sequence in a single contig.
At this point we could just remove D_contig_2.fasta
(by renaming it to end with .bad
like we did in cluster 1), but since we only have six sequences in this cluster, I'm going to try to repair the sequence instead.
To do so, I open the sequence in a text editor and randomly select a piece of sequence near the start (avoiding the region very close the start of the contig, as sometimes contig sequences are less reliable near their start/end). I ended up using the 30-mer CCAAACTCTACAAAAGAGTTTCAATCGATC
but plenty of other sequences would of course work too. Searching for this sequence reveals two hits: near the start where originally got the sequence and another hit in the second half of the contig. The spacing between these two is ~7.5 kbp, which happens to be the length of the other contigs in cluster 2. This all supports my suspicion that D_contig_2
is a doubled version of the plasmid. I therefore fix the sequence by deleting everything before the first hit (1476 bp) and everything after the second hit (including the second hit itself, 5994 bp). This resulted in a fixed D_contig_2
sequence of 7471 bp.
I then tried running Trycycler reconcile again:
trycycler reconcile --reads reads.fastq.gz --cluster_dir trycycler/cluster_002
Which produced this output, and this time it finished!
You can see in the output that during the circularisation step, D_contig_2
was deemed already circular – consistent with our manual intervention for that sequence.
This cluster's final alignment check looks okay. One sequence (G_utg000002c
) is a bit messier than the rest, but it's not too bad so I'll leave it in. Whether I leave or remove slightly-worse-than-the-rest contig depends on how many sequences I have in the cluster. E.g. if there were 12 sequences in the cluster and one looks a bit messy, I would exclude it (and then re-run Trycycler reconcile) because 11 sequences is still plenty. But in this case we only have six sequences in the cluster, so I'm more inclined to leave it in.
- Home
- Software requirements
- Installation
-
How to run Trycycler
- Quick start
- Step 1: Generating assemblies
- Step 2: Clustering contigs
- Step 3: Reconciling contigs
- Step 4: Multiple sequence alignment
- Step 5: Partitioning reads
- Step 6: Generating a consensus
- Step 7: Polishing after Trycycler
- Illustrated pipeline overview
- Demo datasets
- Implementation details
- FAQ and miscellaneous tips
- Other pages
- Guide to bacterial genome assembly (choose your own adventure)
- Accuracy vs depth