-
Notifications
You must be signed in to change notification settings - Fork 28
Polishing after Trycycler
Assuming your long reads are from an Oxford Nanopore sequencer, you can run Medaka on Trycycler's consensus sequences to further increase their accuracy. Medaka uses FASTQ reads as input (as opposed to raw-signal FAST5 reads) which makes it easy to run on Trycycler's clusters with partitioned reads. And last time I checked, Medaka gave the best results for Nanopore-only assemblies.
The commands could look something like this, using a Bash loop to run Medaka on each cluster:
for c in trycycler/cluster_*; do
medaka_consensus -i "$c"/4_reads.fastq -d "$c"/7_final_consensus.fasta -o "$c"/medaka -m r941_min_sup_g507 -t 12
mv "$c"/medaka/consensus.fasta "$c"/8_medaka.fasta
rm -r "$c"/medaka "$c"/*.fai "$c"/*.mmi # clean up
done
Note that you should change the model parameter (-m
) to whatever is most appropriate for your basecalling. Also note that medaka_consensus
is not the same thing as medaka consensus
(underscore vs space) – the former is a convenience script which does the entire process (including read alignment) while the latter is a subcommand of Medaka which only does the polishing step.
You can then combine the Medaka-polished sequences into a single FASTA file:
cat trycycler/cluster_*/8_medaka.fasta > trycycler/consensus.fasta
If you also have Illumina reads, then you can use short-read polishers to further improve your assembly. My two favourite short-read polishers are Polypolish and Pypolca, because they seem to be very safe options – i.e. they are very unlikely to introduce errors. Check out the Polypolish manuscript for lots more detail on how the various short-read polishers perform.
The following instructions show how I do short-read polishing of a Trycycler assembly.
- fastp for Illumina read QC
- BWA for Illumina read alignment
- Polypolish for assembly polishing
- Pypolca for more assembly polishing
- other polishing tools (optional)
-
consensus.fasta
: the Trycycler assembly to be polished (post-Medaka if possible). This polishing is done on the entire genome (i.e. not per-cluster). -
reads_1.fastq.gz
,reads_2.fastq.gz
: paired-end Illumina reads
fastp --in1 reads_1.fastq.gz --in2 reads_2.fastq.gz --out1 1.fastq.gz --out2 2.fastq.gz --unpaired1 u.fastq.gz --unpaired2 u.fastq.gz
This produces three output Illumina read files: 1.fastq.gz
and 2.fastq.gz
(the paired reads) and u.fastq.gz
(the unpaired reads that were orphaned by the QC process). The u.fastq.gz
file should contain very few reads, and so I usually discard it for simplicity. If that's not the case (i.e. if your u.fastq.gz
file is big) then something might be wrong with your reads!
Check out the Polypolish repo for more info, but here's a short version:
bwa index consensus.fasta
bwa mem -t 16 -a consensus.fasta reads_1.fastq.gz > alignments_1.sam
bwa mem -t 16 -a consensus.fasta reads_2.fastq.gz > alignments_2.sam
polypolish filter --in1 alignments_1.sam --in2 alignments_2.sam --out1 filtered_1.sam --out2 filtered_2.sam
polypolish polish consensus.fasta filtered_1.sam filtered_2.sam > polypolish.fasta
Check out the pypolca repo for more info, but here's a short version:
pypolca run --careful -a polypolish.fasta -1 reads_1.fastq.gz -2 reads_2.fastq.gz -t 16 -o pypolca
cp pypolca/pypolca_corrected.fasta polypolish_pypolca.fasta
After one round of Polypolish and one round of Pypolca, your assembly should be in very good shape! However, there may still be a few lingering errors. You can try running additional rounds of Polypolish or Pypolca to see if they make any more changes.
If you're feeling adventurous and thorough, you can also try other polishing tools, like NextPolish, ntEdit, HyPo and Pilon. However, I've seen these tools add errors during polishing as well as fix errors, so be cautious.
ALE scores are well correlated with assembly accuracy, so you can run ALE before and after a polishing round. If the ALE score gets larger, any changes made are likely to be improvements. Note that ALE scores are negative so 'larger' means closer to zero.
One final note: I've seen cases where Pilon makes a large change (e.g. collapsing a repetitive region) that leads to increase in ALE score even though the change was in error. So ALE scores aren't perfect. I'd therefore recommend you be very sceptical of any large changes made by any polisher. After all, the whole point of Trycycler is to make an assembly that's free of large-scale errors.
- Home
- Software requirements
- Installation
-
How to run Trycycler
- Quick start
- Step 1: Generating assemblies
- Step 2: Clustering contigs
- Step 3: Reconciling contigs
- Step 4: Multiple sequence alignment
- Step 5: Partitioning reads
- Step 6: Generating a consensus
- Step 7: Polishing after Trycycler
- Illustrated pipeline overview
- Demo datasets
- Implementation details
- FAQ and miscellaneous tips
- Other pages
- Guide to bacterial genome assembly (choose your own adventure)
- Accuracy vs depth