-
Notifications
You must be signed in to change notification settings - Fork 28
Generating assemblies
There is no single prescribed way to prepare your assemblies. The goal is to have multiple assemblies (~12 is nice) which are reasonably independent of each other. To achieve this independence, you can use different assemblers and/or different subsets of your reads.
This page describes how I prepare my assemblies, in case you want to copy me. It's not the only way to do it, but I hope it gives you a good starting point.
I'm assuming you have long-read sequencing for your isolate which is of decent depth. In the commands below, I assume these are in a file named input_reads.fastq.gz
. Ideally your reads are at least 100×, but more is better. If you have a read set that's 200× depth or more, that's wonderful! If you have only have 50× depth or less, running Trycycler may be more difficult – you've been warned!
I'm also assuming that your reads are sufficiently long. Practically, this means longer than the longest repeat in the genome, as this will allow most assemblies to reach completion. For many bacterial genomes, the longest repeat will be the ribosomal RNA operon which is ~5.5 kbp, in which case a read N50 of 10 kbp is sufficient. However, some genomes have much longer repeats and will require longer reads. Overly-long reads rarely cause a problem, so you should aim for the longest reads you reasonably can.
It will also be handy to know the approximate size of your genome. If you're assembling something novel (i.e. you don't know the genome size), you could do a quick initial assembly at this point to get an estimate.
Finally, you'll need a couple tools installed:
As I first step, I usually do a bit of light QC of the reads. By "light" I mean that I only want to toss out very bad reads. I would rather keep mediocre reads because high read depth will help with the next step.
Something like this should work:
filtlong --min_length 1000 --keep_percent 95 input_reads.fastq.gz > reads.fastq
This will discard short reads (less than 1 kbp) and very bad reads (the worst 5%).
If you have a very deep read set, you might be able to do more aggressive QC here, but small plasmids can be a problem. Filtlong prefers longer reads to shorter reads, so if you filter aggressively with Filtlong, you might lose your small plasmid reads. I would therefore recommend that any aggressive QC filtering be done on read quality more than read length.
Now we want to subsample the QC'd reads to make multiple different read sets. This is handled by the Trycycler subsample command:
trycycler subsample --reads reads.fastq --out_dir read_subsets
Trycycler subsample tries to make maximally-independent read subsets of an appropriate depth for your genome. Read more about it on this page: How read subsampling works.
It has the following settings:
-
--count
: the number of subsampled read sets to output. The default is 12, a good number for most cases. -
--genome_size
: an estimate of the isolate's total genome size. If you don't provide this, Trycycler subsample will run miniasm to quickly assemble your read set to get a size. If you know your genome size, you can provide a value (e.g.--genome_size 5.5m
) and Trycycler subsample will run faster because it won't run miniasm. This value is used to calculate read depths and doesn't need to be exact, e.g. 10% error is fine. -
--min_read_depth
: this is the minimum allowed read depth and also controls the subsampled depths (see How read subsampling works for more information). -
--threads
: this is how many threads Trycycler will use for the miniasm assembly. It will only affect the speed performance, so you'll probably want to use as many threads as you have available. If you provided a genome size estimate, then miniasm isn't run and this setting has no effect.
When Trycycler subsample is done, you'll find your read subsets here:
read_subsets/sample_*.fastq
There are many different assemblers you can use, but I like Flye, Miniasm+Minipolish and Raven. They all do well with bacterial genomes, are reasonably fast and are easy to run. Check out the Extra-thorough assembly section below for other assemblers.
With 12 read sets (the default for Trycycler subsample) and three assemblers, it works to do four assemblies with each:
threads=16 # change as appropriate for your system
mkdir assemblies
flye --nano-hq read_subsets/sample_01.fastq --threads "$threads" --out-dir assembly_01 && cp assembly_01/assembly.fasta assemblies/assembly_01.fasta && cp assembly_01/assembly_graph.gfa assemblies/assembly_01.gfa && rm -r assembly_01
miniasm_and_minipolish.sh read_subsets/sample_02.fastq "$threads" > assemblies/assembly_02.gfa && any2fasta assemblies/assembly_02.gfa > assemblies/assembly_02.fasta
raven --threads "$threads" --disable-checkpoints --graphical-fragment-assembly assemblies/assembly_03.gfa read_subsets/sample_03.fastq > assemblies/assembly_03.fasta
flye --nano-hq read_subsets/sample_04.fastq --threads "$threads" --out-dir assembly_04 && cp assembly_04/assembly.fasta assemblies/assembly_04.fasta && cp assembly_04/assembly_graph.gfa assemblies/assembly_04.gfa && rm -r assembly_04
miniasm_and_minipolish.sh read_subsets/sample_05.fastq "$threads" > assemblies/assembly_05.gfa && any2fasta assemblies/assembly_05.gfa > assemblies/assembly_05.fasta
raven --threads "$threads" --disable-checkpoints --graphical-fragment-assembly assemblies/assembly_06.gfa read_subsets/sample_06.fastq > assemblies/assembly_06.fasta
flye --nano-hq read_subsets/sample_07.fastq --threads "$threads" --out-dir assembly_07 && cp assembly_07/assembly.fasta assemblies/assembly_07.fasta && cp assembly_07/assembly_graph.gfa assemblies/assembly_07.gfa && rm -r assembly_07
miniasm_and_minipolish.sh read_subsets/sample_08.fastq "$threads" > assemblies/assembly_08.gfa && any2fasta assemblies/assembly_08.gfa > assemblies/assembly_08.fasta
raven --threads "$threads" --disable-checkpoints --graphical-fragment-assembly assemblies/assembly_09.gfa read_subsets/sample_09.fastq > assemblies/assembly_09.fasta
flye --nano-hq read_subsets/sample_10.fastq --threads "$threads" --out-dir assembly_10 && cp assembly_10/assembly.fasta assemblies/assembly_10.fasta && cp assembly_10/assembly_graph.gfa assemblies/assembly_10.gfa && rm -r assembly_10
miniasm_and_minipolish.sh read_subsets/sample_11.fastq "$threads" > assemblies/assembly_11.gfa && any2fasta assemblies/assembly_11.gfa > assemblies/assembly_11.fasta
raven --threads "$threads" --disable-checkpoints --graphical-fragment-assembly assemblies/assembly_12.gfa read_subsets/sample_12.fastq > assemblies/assembly_12.fasta
If all went well, you will now have 12 assemblies in FASTA format (assembly_01.fasta
to assembly_12.fasta
) in a directory named assemblies
. These should all be reasonably similar (they are of the same genome) and you're now ready to run Trycycler to make a consensus assembly! The assemblies are also available in GFA format which can be useful for manual curation (see the next section).
Now that you have your input assemblies, you no longer need the subsampled reads and can clean them up:
rm -r read_subsets
Before continuing to the next step, you might want to manually curate your assemblies, i.e. look at them and judge which look good and which look bad, so you can throw out the bad ones before continuing. 'Bad' in this context means fragmented. My preference is to look at them in Bandage. This is especially useful for assemblers that give you assemblies in GFA graph format, as contig circularity (obvious in Bandage) is a decent indicator of completeness.
This manual curation step can be especially useful if you've got a tough-to-assemble genome and a lot of your input assemblies are fragmented. Giving lots of bad assemblies to Trycycler will result in very messy clusters, so cleaning things up at this point can help. If you had to throw out a lot of your input assemblies (e.g. you made 12 assemblies and threw out 8 of them), then you should consider starting over with a larger number of read subsets (e.g. run Trycycler subsample with --count 24
).
If you want to be very thorough during this assembly step, you can use more read subsets and more assemblers. In addition to Flye, Miniasm+Minipolish and Raven (used above), I like Canu, NECAT and NextDenovo/NextPolish. These six assemblers were the best performing tools in my Benchmarking of long-read assemblers for prokaryote whole genome sequencing paper.
If we produce 24 subsampled read sets, we can do four assemblies with each of the six assemblers:
# Change as appropriate for your system and genome:
threads=16
nextdenovo_dir="/path/to/NextDenovo"
nextpolish_dir="/path/to/NextPolish"
genome_size="5500000"
trycycler subsample --reads reads.fastq --out_dir read_subsets --count 24 --genome_size "$genome_size"
mkdir assemblies
for i in 01 07 13 19; do
canu -p canu -d canu_temp -fast genomeSize="$genome_size" useGrid=false maxThreads="$threads" -nanopore read_subsets/sample_"$i".fastq
canu_trim.py canu_temp/canu.contigs.fasta > assemblies/assembly_"$i".fasta
rm -rf canu_temp
done
for i in 02 08 14 20; do
flye --nano-hq read_subsets/sample_"$i".fastq --threads "$threads" --out-dir flye_temp
cp flye_temp/assembly.fasta assemblies/assembly_"$i".fasta
cp flye_temp/assembly_graph.gfa assemblies/assembly_"$i".gfa
rm -r flye_temp
done
for i in 03 09 15 21; do
miniasm_and_minipolish.sh read_subsets/sample_"$i".fastq "$threads" > assemblies/assembly_"$i".gfa
any2fasta assemblies/assembly_"$i".gfa > assemblies/assembly_"$i".fasta
done
for i in 04 10 16 22; do
necat.pl config config.txt
realpath read_subsets/sample_"$i".fastq > read_list.txt
sed -i "s/PROJECT=/PROJECT=necat/" config.txt
sed -i "s/ONT_READ_LIST=/ONT_READ_LIST=read_list.txt/" config.txt
sed -i "s/GENOME_SIZE=/GENOME_SIZE="$genome_size"/" config.txt
sed -i "s/THREADS=4/THREADS="$threads"/" config.txt
necat.pl bridge config.txt
cp necat/6-bridge_contigs/polished_contigs.fasta assemblies/assembly_"$i".fasta
rm -r necat config.txt read_list.txt
done
for i in 05 11 17 23; do
echo read_subsets/sample_"$i".fastq > input.fofn
cp "$nextdenovo_dir"/doc/run.cfg nextdenovo_run.cfg
sed -i "s/genome_size = 1g/genome_size = "$genome_size"/" nextdenovo_run.cfg
sed -i "s/parallel_jobs = 20/parallel_jobs = 1/" nextdenovo_run.cfg
sed -i "s/read_type = clr/read_type = ont/" nextdenovo_run.cfg
sed -i "s/pa_correction = 3/pa_correction = 1/" nextdenovo_run.cfg
sed -i "s/correction_options = -p 15/correction_options = -p "$threads"/" nextdenovo_run.cfg
sed -i "s/-t 8/-t "$threads"/" nextdenovo_run.cfg
"$nextdenovo_dir"/nextDenovo nextdenovo_run.cfg
cp 01_rundir/03.ctg_graph/nd.asm.fasta nextdenovo_temp.fasta
rm -r 01_rundir nextdenovo_run.cfg input.fofn
echo read_subsets/sample_"$i".fastq > lgs.fofn
cat "$nextpolish_dir"/doc/run.cfg | grep -v "sgs" | grep -v "hifi" > nextpolish_run.cfg
sed -i "s/parallel_jobs = 6/parallel_jobs = 1/" nextpolish_run.cfg
sed -i "s/multithread_jobs = 5/multithread_jobs = "$threads"/" nextpolish_run.cfg
sed -i "s|genome = ./raw.genome.fasta|genome = nextdenovo_temp.fasta|" nextpolish_run.cfg
sed -i "s|-x map-ont|-x map-ont -t "$threads"|" nextpolish_run.cfg
"$nextpolish_dir"/nextPolish nextpolish_run.cfg
cp 01_rundir/genome.nextpolish.fasta assemblies/assembly_"$i".fasta
rm -r 01_rundir pid*.log.info nextpolish_run.cfg lgs.fofn nextdenovo_temp.fasta
done
for i in 06 12 18 24; do
raven --threads "$threads" --disable-checkpoints --graphical-fragment-assembly assemblies/assembly_"$i".gfa read_subsets/sample_"$i".fastq > assemblies/assembly_"$i".fasta
done
Some notes:
- Canu, NECAT and NextDenovo require a genome size estimate to run.
- Canu often produces large amounts of overlap in circular contigs, but its contig headers indicate the necessary trimming to remove this overlap. The
canu_trim.py
script (in Trycycler's repo) will conduct this trimming and is used in the assembly code above. - NextDenovo/NextPolish also produces large amounts of overlap in circular contigs, but unlike Canu, it doesn't indicate how to trim. You will therefore likely have to Manually fix contig overlap during the Reconciling contigs step.
- Canu does a good job with assembly, but it is usually slower than other assemblers, even with its
-fast
option. Be prepared to wait a few hours. - NECAT and NextDenovo/NextPolish rely on config files, making them cumbersome to run. Sorry for all the lines in the assembly code above!
- La Jolla Assembler (LJA) might be worth a try as well if you are assembling HiFi reads. I haven't tried it myself, but it was recommended by Thomas Roder.
- Home
- Software requirements
- Installation
-
How to run Trycycler
- Quick start
- Step 1: Generating assemblies
- Step 2: Clustering contigs
- Step 3: Reconciling contigs
- Step 4: Multiple sequence alignment
- Step 5: Partitioning reads
- Step 6: Generating a consensus
- Step 7: Polishing after Trycycler
- Illustrated pipeline overview
- Demo datasets
- Implementation details
- FAQ and miscellaneous tips
- Other pages
- Guide to bacterial genome assembly (choose your own adventure)
- Accuracy vs depth