-
Notifications
You must be signed in to change notification settings - Fork 28
Polishing after Trycycler
Assuming your long reads are from an Oxford Nanopore sequencer, you can run Medaka on Trycycler's consensus sequences to further increase their accuracy. Medaka uses FASTQ reads as input (as opposed to raw-signal FAST5 reads) which makes it easy to run on Trycycler's clusters with partitioned reads. And last time I checked, Medaka gave the best results for Nanopore-only assemblies.
The commands could look something like this, using a Bash loop to run Medaka on each cluster:
for c in trycycler/cluster_*; do
medaka_consensus -i "$c"/4_reads.fastq -d "$c"/7_final_consensus.fasta -o "$c"/medaka -m r941_min_sup_g507 -t 12
mv "$c"/medaka/consensus.fasta "$c"/8_medaka.fasta
rm -r "$c"/medaka "$c"/*.fai "$c"/*.mmi # clean up
done
Note that you should change the model parameter (-m
) to whatever is most appropriate for your basecalling.
You can then combine the Medaka-polished sequences into a single FASTA file:
cat trycycler/cluster_*/8_medaka.fasta > trycycler/consensus.fasta
If you also have Illumina reads, then you could use Pilon to polish more. The following instructions show how I do Pilon-polishing of a Trycycler assembly.
- fastp for Illumina read QC
- Bowtie2 for Illumina read alignment
- Samtools for BAM processing
- Pilon for assembly polishing
-
consensus.fasta
: the Trycycler assembly to be polished (post-Medaka if possible). This polishing is done on the entire genome (i.e. not per-cluster). -
reads_1.fastq.gz
,reads_2.fastq.gz
: paired Illumina reads
fastp --in1 reads_1.fastq.gz --in2 reads_2.fastq.gz --out1 1.fastq.gz --out2 2.fastq.gz --unpaired1 u.fastq.gz --unpaired2 u.fastq.gz
This produces three output Illumina read files: 1.fastq.gz
and 2.fastq.gz
(the paired reads) and u.fastq.gz
(the unpaired reads that were orphaned by the QC process).
Pilon likes to know whether reads aligned in a proper pair or not, so it's important to give Bowtie2 appropriate fragment length parameters.
Paired-end insert sizes (a.k.a. fragment lengths) vary between read sets, so we can do an initial alignment to get an estimate of the typical insert range:
bowtie2-build consensus.fasta consensus.fasta
bowtie2 -1 1.fastq.gz -2 2.fastq.gz -x consensus.fasta --fast --threads 16 -I 0 -X 1000 -S insert_size_test.sam
Then open a Python interpreter (run python3
) and paste in this code to get the 1st and 99th percentiles of the insert size:
import math
def get_percentile(sorted_list, percentile):
rank = int(math.ceil(percentile / 100.0 * len(sorted_list)))
if rank == 0:
return sorted_list[0]
return sorted_list[rank - 1]
insert_sizes = []
with open('insert_size_test.sam', 'rt') as sam:
for sam_line in sam:
try:
sam_parts = sam_line.split('\t')
sam_flags = int(sam_parts[1])
if sam_flags & 2: # read mapped in proper pair
insert_size = int(sam_parts[8])
if insert_size > 0:
insert_sizes.append(insert_size)
except (ValueError, IndexError):
pass
insert_sizes = sorted(insert_sizes)
insert_size_1st = get_percentile(insert_sizes, 1.0)
insert_size_99th = get_percentile(insert_sizes, 99.0)
print(insert_size_1st, insert_size_99th)
If you're feeling impatient, you can run this Python code before Bowtie2 finishes, and then kill Bowtie2. I.e. you don't need to align all the reads to get a good estimate of the insert size range. Letting it run for a couple minutes is probably enough to get a sufficient number of alignments to work with.
Now that you've got your values, you can clean up:
rm *.bt2 insert_size_test.sam
To make it easier to reuse this code, we can set some Bash variables before we start:
before=consensus
after=round_1
threads=16 # set to the number of alignment threads you want to use
insert_min=198 # set to the insert size 1st percentile
insert_max=663 # set to the insert size 99th percentile
We then perform the read alignments, creating two separate BAMs: one for paired reads and one for unpaired reads:
bowtie2-build "$before".fasta "$before".fasta
bowtie2 -1 1.fastq.gz -2 2.fastq.gz -x "$before".fasta --threads "$threads" -I "$insert_min" -X "$insert_max" --local --very-sensitive-local | samtools sort > illumina_alignments.bam
bowtie2 -U u.fastq.gz -x "$before".fasta --threads "$threads" --local --very-sensitive-local | samtools sort > illumina_alignments_u.bam
samtools index illumina_alignments.bam
samtools index illumina_alignments_u.bam
We then run Pilon:
pilon --genome "$before".fasta --frags illumina_alignments.bam --unpaired illumina_alignments_u.bam --output "$after" --changes
And clean up:
rm *.bam *.bam.bai *.bt2
sed -i 's/_pilon//' "$after".fasta # remove "_pilon" from the FASTA headers
You can then do another round of Pilon by changing the $before
and $after
variables:
before=round_1
after=round_2
And then run the code from step 3 again.
The second round will hopefully produce many fewer changes than the first round (perhaps none at all). You can check this by looking in the *.changes
file that Pilon makes.
How many rounds of Pilon you run is up to you. You can continue to do polish until Pilon stops making any changes. Assuming your input assembly and read sets are good, this will hopefully happen after just a round or two.
- Home
- Software requirements
- Installation
-
How to run Trycycler
- Quick start
- Step 1: Generating assemblies
- Step 2: Clustering contigs
- Step 3: Reconciling contigs
- Step 4: Multiple sequence alignment
- Step 5: Partitioning reads
- Step 6: Generating a consensus
- Step 7: Polishing after Trycycler
- Illustrated pipeline overview
- Demo datasets
- Implementation details
- FAQ and miscellaneous tips
- Other pages
- Guide to bacterial genome assembly (choose your own adventure)
- Accuracy vs depth