Polishing after Trycycler

Medaka

Assuming your long reads are from an Oxford Nanopore sequencer, you can run Medaka on Trycycler's consensus sequences to further increase their accuracy. Medaka uses FASTQ reads as input (as opposed to raw-signal FAST5 reads) which makes it easy to run on Trycycler's clusters with partitioned reads. And last time I checked, Medaka gave the best results for Nanopore-only assemblies.

The commands could look something like this, using a Bash loop to run Medaka on each cluster:

for c in trycycler/cluster_*; do
    medaka_consensus -i "$c"/4_reads.fastq -d "$c"/7_final_consensus.fasta -o "$c"/medaka -m r941_min_sup_g507 -t 12
    mv "$c"/medaka/consensus.fasta "$c"/8_medaka.fasta
    rm -r "$c"/medaka "$c"/*.fai "$c"/*.mmi  # clean up
done

Note that you should change the model parameter (-m) to whatever is most appropriate for your basecalling.

You can then combine the Medaka-polished sequences into a single FASTA file:

cat trycycler/cluster_*/8_medaka.fasta > trycycler/consensus.fasta

Short-read polishing

If you also have Illumina reads, then you can use short-read polishers to further improve your assembly. The following instructions show how I do short-read polishing of a Trycycler assembly.

Software requirements

fastp for Illumina read QC
BWA for Illumina read alignment
Samtools for BAM processing
Polypolish for assembly polishing
POLCA for more assembly polishing

Input files

consensus.fasta: the Trycycler assembly to be polished (post-Medaka if possible). This polishing is done on the entire genome (i.e. not per-cluster).
reads_1.fastq.gz, reads_2.fastq.gz: paired Illumina reads

Step 1: read QC

fastp --in1 reads_1.fastq.gz --in2 reads_2.fastq.gz --out1 1.fastq.gz --out2 2.fastq.gz --unpaired1 u.fastq.gz --unpaired2 u.fastq.gz

This produces three output Illumina read files: 1.fastq.gz and 2.fastq.gz (the paired reads) and u.fastq.gz (the unpaired reads that were orphaned by the QC process). The u.fastq.gz file should contain very few reads, and so I usually discard it for simplicity. If that's not the case (i.e. if your u.fastq.gz file is big) then something might be wrong with your read set!

Step 2: Polypolish

Check out the Polypolish page more info, but here's a short version:

bwa index consensus.fasta
bwa mem -t 16 -a consensus.fasta 1.fastq.gz > alignments_1.sam
bwa mem -t 16 -a consensus.fasta 2.fastq.gz > alignments_2.sam
polypolish consensus.fasta alignments_1.sam alignments_2.sam > polypolish.fasta
rm alignments_1.sam alignments_2.sam

Step 3: POLCA

Check out the POLCA page more info, but here's a short version:

polca.sh -a polypolish.fasta -r "1.fastq.gz 2.fastq.gz" -t 16 -m 1G
mv *.PolcaCorrected.fa polypolish_polca.fasta

Step 4: (optional) use other polishers

Home
Software requirements
Installation
How to run Trycycler
- Quick start
- Step 1: Generating assemblies
- Step 2: Clustering contigs
- Step 3: Reconciling contigs
- Step 4: Multiple sequence alignment
- Step 5: Partitioning reads
- Step 6: Generating a consensus
- Step 7: Polishing after Trycycler
Illustrated pipeline overview
Demo datasets
Implementation details
FAQ and miscellaneous tips
Other pages
- Guide to bacterial genome assembly (choose your own adventure)
- Accuracy vs depth

Provide feedback

Saved searches

Use saved searches to filter your results more quickly