Long-reads assembly tutorial

Be sure to launch your VM with -X option, so we can visualize assembly graphs with Bandage.

ssh -X -A ubuntu@xxx.xxx.xxx.xxx

If this is not the case yet remember to activate the correct conda env:

conda activate LongReads

We are going to work in the following directory:

mkdir -p ~/data/mydatalocal/LongReads
cd ~/data/mydatalocal/LongReads

Copy the pre-runs folder in your working directory:

cp -r ~/repos/Ebame/tmp/preruns ~/data/mydatalocal/LongReads

Dataset

There are 3 Zymo mock samples, available at:

ls -lh ~/data/public/teachdata/ebame/metagenomics-assembly/

As previously we can use the command seqkit stats to assess these samples. From the stats, try to guess which one is the Hifi, ONT_R9 and ONT_R10.

Solution

seqkit stats --all ~/data/public/teachdata/ebame/metagenomics-assembly/SRR13128014_subreads.fastq.gz

Pre-runs for seqkit are located here: ~/data/mydatalocal/LongReads/preruns/datasets/

Assembly

Mostly 3 software have been developed for assembling long-read metagenomic datasets.

metaFlye
hifiasm-meta
metaMDBG

In this tutorial, we are going to run metaflye and metaMDBG separatly, then use a method to merge their results.

As usual, try to craft your own command line to run the software.

Run metaflye

Let's run metaflye first.

Use the following commands to see usage information for metaflye:

flye -h

Solution

ONT:
flye --nano-hq ~/data/public/teachdata/ebame/metagenomics-assembly/SRR17913199_1.fastq.gz --out-dir ~/data/mydatalocal/LongReads/metaflye_asm_ont --threads 4 --meta

Hifi:
flye --pacbio-hifi ~/data/public/teachdata/ebame/metagenomics-assembly/SRR13128014_subreads.fastq.gz --out-dir ~/data/mydatalocal/LongReads/metaflye_asm_hifi --threads 4 --meta

Assembly takes a lot of time, so instead lets comment on the pre-run version.

#Metaflye results on ONT
ls ~/data/mydatalocal/LongReads/preruns/assembly/SRR17913199_ONT_Q20/metaflye/

#Metaflye results on HiFi
ls ~/data/mydatalocal/LongReads/preruns/assembly/SRR13128014_hifi/metaflye/

--> look at assembly statistics

--> look at actual size of longer contigs

Use Bandage to compare the Zymo ONT and HiFi assemblies.

#Decompress gfa file first, then run Bandage
gzip -d ~/data/mydatalocal/LongReads/preruns/assembly/SRR17913199_ONT_Q20/metaflye/assembly_graph.gfa.gz
gzip -d ~/data/mydatalocal/LongReads/preruns/assembly/SRR13128014_hifi/metaflye/assembly_graph.gfa.gz

#ONT graph
Bandage load ~/data/mydatalocal/LongReads/preruns/assembly/SRR17913199_ONT_Q20/metaflye/assembly_graph.gfa

#Hifi graph
Bandage load ~/data/mydatalocal/LongReads/preruns/assembly/SRR13128014_hifi/metaflye/assembly_graph.gfa

If you have issues with Bandage

Fix1

Did you use -X or -Y when connecting to the VM? If not, please disconect and retype ssh with that flag:

ssh -X ubuntu@xxx.xxx.xxx.xxx

Fix2

If you have Bandage on your laptop, use the scp command to download the gfa file on your laptop:

scp ubuntu@xxx.xxx.xxx.xxx:~/data/mydatalocal/HiFi/prerun_asm/asm.p_ctg.gfa	.

This will copy the file to the directory you executed that command from. Also to be clear this command should not be run on the vm. This is a command for your laptop to request that file from the distant server. So it should be run on a terminal before you connect to the vm.

Fix3

Try and follow explanation on how to forward display from this google doc: https://docs.google.com/document/d/1VPnL-5mXXQimkXQNiQagPhgzRn8j1JBHCLV42r8-Wqc/edit#

This Zymo mock community contains 5 ecoli strains, let's try to find them using the blast feature.

Click "Create/view Blast search" button
Click "Build Blast database"
Click "Load from fasta file" -> Select a reference genome in ~/data/mydatalocal/LongReads/references/
Click "Run blast search"
Recommanded: try to tune the blast filter (alignment length and identity)

Run metaMDBG

Now, let's try to run metaMDBG on the zymo mock communities.

metaMDBG asm -h

Solution

Hifi:
metaMDBG asm --in-hifi ~/data/public/teachdata/ebame/metagenomics-assembly/SRR13128014_subreads.fastq.gz --out-dir ~/data/mydatalocal/LongReads/metaMDBG_asm_hifi --threads 4

ONT:
metaMDBG asm --in-ont ~/data/public/teachdata/ebame/metagenomics-assembly/SRR17913199_1.fastq.gz --out-dir ~/data/mydatalocal/LongReads/metaMDBG_asm_ont --threads 4

Let's wait for metaMDBG to finish a few multi-k iterations.

With the command "metaMDBG gfa", we can generate the assembly graph corresponding to each multi-k iteration. Lower-k graph will have more connectivity, while higher-k graph will have more repeats solved but also more fragmentation. It is interesting to work on lower-k graph if you have external source of data that could solve long repeat (for instance, HiC, ultra long reads, binning metrics).

Let' try to generate an assembly graph with a low k value. The following command shows the available values for k and their corresponding size in bps.

metaMDBG gfa --assembly-dir ~/data/mydatalocal/LongReads/metaMDBG_asm_ont --k 0

Choose a value for k and wait for metaMDBG to generate the graph.

Solution

metaMDBG gfa ~/data/mydatalocal/LongReads/metaMDBG_asm_ont/ --k 10

Visualize the assembly graph with Bandage

Solution

Bandage load ~/data/mydatalocal/LongReads/metaMDBG_asm_ont/assemblyGraph_k10_1813bps.gfa

Now let's check the final metaMDBG assembly results:

#Show metaMDBG output files
ls -lh ~/data/mydatalocal/LongReads/preruns/assembly/SRR13128014_hifi/metaMDBG/

Let's run Bandage and check how metaMDBG handles the ecoli strains:

#Decompress gfa file first, then run Bandage
gzip -d ~/data/mydatalocal/LongReads/preruns/assembly/SRR13128014_hifi/metaMDBG/assemblyGraph_k105_20813bps.gfa.gz
Bandage load ~/data/mydatalocal/LongReads/preruns/assembly/SRR13128014_hifi/metaMDBG/assemblyGraph_k105_20813bps.gfa

MAGs?

Let's focus here only on circular contigs.

Extract circular contigs

For metaMDBG, we can check if a contig is circular by looking at the contig headers in the fasta files. If a header contains the field "circular=yes", it means that the contig is circular, otherwise it is linear. You can check this info with the following command:

#Show all headers
zcat ~/data/mydatalocal/LongReads/preruns/assembly/SRR13128014_hifi/metaMDBG/contigs.fasta.gz | grep ">"

#Show header with the circular flag = "yes"
zcat ~/data/mydatalocal/LongReads/preruns/assembly/SRR13128014_hifi/metaMDBG/contigs.fasta.gz | grep ">" | grep "circular=yes"

Metaflye uses an extra metadata file for contig information

#Show contig metadata
cat ~/data/mydatalocal/LongReads/preruns/assembly/SRR13128014_hifi/metaflye/assembly_info.txt

Let's create folders for the circular contigs:

mkdir -p ~/data/mydatalocal/HiFi/circularContigs/
mkdir -p ~/data/mydatalocal/HiFi/circularContigs/metaflye/
mkdir -p ~/data/mydatalocal/HiFi/circularContigs/metaMDBG/

We use a custom script to extract all circular contigs:

#Extract metaflye circular contigs (HiFi)
python3 ~/repos/Ebame/scripts/extractCircularContigs.py ~/data/mydatalocal/LongReads/preruns/assembly/SRR13128014_hifi/metaflye/assembly.fasta.gz ~/data/mydatalocal/LongReads/preruns/assembly/SRR13128014_hifi/metaflye/assembly_info.txt metaflye ~/data/mydatalocal/LongReads/circularContigs/metaflye/

#Extract metaMDBG circular contigs (HiFi)
python3 ~/repos/Ebame/scripts/extractCircularContigs.py ~/data/mydatalocal/LongReads/preruns/assembly/SRR13128014_hifi/metaMDBG/contigs.fasta.gz ~/data/mydatalocal/LongReads/preruns/assembly/SRR13128014_hifi/metaMDBG/contigs.fasta.gz metaMDBG ~/data/mydatalocal/LongReads/circularContigs/metaMDBG/

List the extracted circular contigs:

ls -lh ~/data/mydatalocal/LongReads/circularContigs/metaflye/
ls -lh ~/data/mydatalocal/LongReads/circularContigs/metaMDBG/

Assembly reconciliation

We are now going to merge the results of metaMDBG and metaflye. The idea is to compute the similarity (ANI) between the circular contigs, and to choose only one representative if two contigs are duplicated (ANI > 0.95 by default).

For this task, we are going to use the software dRep. dRep has a lot of options, let's craft the command together, first display de-replicate options:

dRep dereplicate -h

First choose a output folder
dRep takes a list of genomes as input (option -g), try to provide all the fasta file with regex
- Metaflye circular contigs are here: ~/data/mydatalocal/LongReads/circularContigs/metaflye/
- MetaMDBG circular contigs are here: ~/data/mydatalocal/LongReads/circularContigs/metaMDBG/
Disable quality check with option --ignoreGenomeQuality (we'll do it after dereplication)
Disable plotting with --skip_plots
Try to select the "skani" method for comparing the circular contigs quickly

Solution

dRep dereplicate ~/data/mydatalocal/LongReads/drep_circular/ -p 4 -g ~/data/mydatalocal/LongReads/circularContigs/metaflye/*.fa ~/data/mydatalocal/LongReads/circularContigs/metaMDBG/*.fa --S_algorithm skani --ignoreGenomeQuality --skip_plots

Read dRep output information and try to list the folder containing dereplicated contigs.

Solution

ls -lh ~/data/mydatalocal/LongReads/drep_circular/dereplicated_genomes/

dRep provides a lot of useful information, for instance, we can look at the similarity between the pair of circular contigs:

column -s, -t < ~/data/mydatalocal/LongReads/drep_circular/data_tables/Ndb.csv

Are there any circular contigs which are only found by one assembler?

column -s, -t < ~/data/mydatalocal/LongReads/drep_circular/data_tables/Cdb.csv

Assess quality of circular contigs

Clearly some of these are not genomes, let's run checkm on the dereplicated contigs: (by the way, sorry, there is a checkm2 version now which is way faster and user-friendly)

checkm lineage_wf  ~/data/mydatalocal/LongReads/drep_circular/dereplicated_genomes/ ~/data/mydatalocal/LongReads/checkm_output/ -t 4 --pplacer_threads 4  -r -x .fa --tab_table -f ~/data/mydatalocal/LongReads/checkm_results.tsv

CheckM is a bit slow, so let's check the prerun results

cat ~/data/mydatalocal/LongReads/preruns/checkm_results.tsv

Print columns corresponding to completeness and contamination:

awk -F"\t" '{ print $1, "\t", $12, "\t", $13 }' ~/data/mydatalocal/LongReads/preruns/checkm_results.tsv

Plasmids and virus?

Some of the smaller circular contigs are likely to be plasmids or virus. Let's use genomad, a machine learning approach to verify this.

As usual, let's check the software usage information:

genomad -h
genomad end-to-end -h

Genomad takes as input a single fasta file. It will then process the contigs one by one and determine how likely they are to be plasmids or virus. Let's concatenate the small dereplicated circular contigs in a single fasta file.

Solution

#Concatenate all circular contigs in a single fasta file
cat ~/data/mydatalocal/LongReads/drep_circular/dereplicated_genomes/*.fa > ~/data/mydatalocal/LongReads/allCircularContigs.fasta

#Concatenante only small circular contigs
find ~/data/mydatalocal/LongReads/drep_circular/dereplicated_genomes/*.fa -size -500k | xargs cat > ~/data/mydatalocal/LongReads/allSmallCircularContigs.fasta

The genomad database is located here:

~/data/public/teachdata/ebame/viral-metagenomics/genomad_db/

Let's run try to run genomad now.

Solution

genomad end-to-end ~/data/mydatalocal/LongReads/allSmallCircularContigs.fasta ~/data/mydatalocal/LongReads/genomad/ ~/data/public/teachdata/ebame/viral-metagenomics/genomad_db/ --conservative --threads 4

Read genomad logs and try to print plasmids and virus summaries: (pre-runs are located here: ~/data/mydatalocal/LongReads/preruns/genomad/)

Solution

cat ~/data/mydatalocal/LongReads/preruns/genomad/allSmallCircularContigs_summary/allSmallCircularContigs_plasmid_summary.tsv
cat ~/data/mydatalocal/LongReads/preruns/genomad/allSmallCircularContigs_summary/allSmallCircularContigs_virus_summary.tsv

By the way, there is a software called checkv to assess the quality of viral contigs.

Other contigs

But wait what if I want to look at one of the non circular contigs?

Retry to use checkm on a contigs you chose and saved with Bandage.

--> find a long contigs from Bandage. Then click on "Output" menu -> "Save selected node sequence to FASTA"

Or from the command line, use the following command line replacing <NODE>:

Bandage reduce ~/data/mydatalocal/LongReads/preruns/assembly/SRR13128014_hifi/metaflye/assembly_graph.gfa ~/data/mydatalocal/LongReads/<Node>.gfa  --scope aroundnodes --nodes <NODE> --distance 0

--> use a bash online to extract, name and sequence from that gfa graph:

awk '/^S/{print ">"$2"\n"$3}' ~/data/mydatalocal/LongReads/<Node>.gfa > ~/data/mydatalocal/LongReads/<Node>.fasta

--> use checkm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LongRead2024.md

LongRead2024.md

Long-reads assembly tutorial

Dataset

Assembly

Run metaflye

Fix1

Fix2

Fix3

Run metaMDBG

MAGs?

Extract circular contigs

Assembly reconciliation

Assess quality of circular contigs

Plasmids and virus?

Other contigs

But wait what if I want to look at one of the non circular contigs?

Files

LongRead2024.md

Latest commit

History

LongRead2024.md

File metadata and controls

Long-reads assembly tutorial

Dataset

Assembly

Run metaflye

Fix1

Fix2

Fix3

Run metaMDBG

MAGs?

Extract circular contigs

Assembly reconciliation

Assess quality of circular contigs

Plasmids and virus?

Other contigs

But wait what if I want to look at one of the non circular contigs?