Be sure to launch your VM with -X option, so we can visualize assembly graphs with Bandage.
ssh -X -A ubuntu@xxx.xxx.xxx.xxx
If this is not the case yet remember to activate the correct conda env:
conda activate LongReads
We are going to work in the following directory:
mkdir -p ~/data/mydatalocal/LongReads
cd ~/data/mydatalocal/LongReads
Copy the pre-runs folder in your working directory:
cp -r ~/repos/Ebame/tmp/preruns ~/data/mydatalocal/LongReads
There are 3 Zymo mock samples, available at:
ls -lh ~/data/public/teachdata/ebame/metagenomics-assembly/
As previously we can use the command seqkit stats to assess these samples. From the stats, try to guess which one is the Hifi, ONT_R9 and ONT_R10.
Solution
seqkit stats --all ~/data/public/teachdata/ebame/metagenomics-assembly/SRR13128014_subreads.fastq.gz
Pre-runs for seqkit are located here: ~/data/mydatalocal/LongReads/preruns/datasets/
Mostly 3 software have been developed for assembling long-read metagenomic datasets.
In this tutorial, we are going to run metaflye and metaMDBG separatly, then use a method to merge their results.
As usual, try to craft your own command line to run the software.
Let's run metaflye first.
Use the following commands to see usage information for metaflye:
flye -h
Solution
ONT:
flye --nano-hq ~/data/public/teachdata/ebame/metagenomics-assembly/SRR17913199_1.fastq.gz --out-dir ~/data/mydatalocal/LongReads/metaflye_asm_ont --threads 4 --meta
Hifi:
flye --pacbio-hifi ~/data/public/teachdata/ebame/metagenomics-assembly/SRR13128014_subreads.fastq.gz --out-dir ~/data/mydatalocal/LongReads/metaflye_asm_hifi --threads 4 --meta
Assembly takes a lot of time, so instead lets comment on the pre-run version.
#Metaflye results on ONT
ls ~/data/mydatalocal/LongReads/preruns/assembly/SRR17913199_ONT_Q20/metaflye/
#Metaflye results on HiFi
ls ~/data/mydatalocal/LongReads/preruns/assembly/SRR13128014_hifi/metaflye/
--> look at assembly statistics
--> look at actual size of longer contigs
Use Bandage to compare the Zymo ONT and HiFi assemblies.
#Decompress gfa file first, then run Bandage
gzip -d ~/data/mydatalocal/LongReads/preruns/assembly/SRR17913199_ONT_Q20/metaflye/assembly_graph.gfa.gz
gzip -d ~/data/mydatalocal/LongReads/preruns/assembly/SRR13128014_hifi/metaflye/assembly_graph.gfa.gz
#ONT graph
Bandage load ~/data/mydatalocal/LongReads/preruns/assembly/SRR17913199_ONT_Q20/metaflye/assembly_graph.gfa
#Hifi graph
Bandage load ~/data/mydatalocal/LongReads/preruns/assembly/SRR13128014_hifi/metaflye/assembly_graph.gfa
If you have issues with Bandage
Did you use -X or -Y when connecting to the VM? If not, please disconect and retype ssh with that flag:
ssh -X ubuntu@xxx.xxx.xxx.xxx
If you have Bandage on your laptop, use the scp command to download the gfa file on your laptop:
scp ubuntu@xxx.xxx.xxx.xxx:~/data/mydatalocal/HiFi/prerun_asm/asm.p_ctg.gfa .
This will copy the file to the directory you executed that command from. Also to be clear this command should not be run on the vm. This is a command for your laptop to request that file from the distant server. So it should be run on a terminal before you connect to the vm.
Try and follow explanation on how to forward display from this google doc: https://docs.google.com/document/d/1VPnL-5mXXQimkXQNiQagPhgzRn8j1JBHCLV42r8-Wqc/edit#
This Zymo mock community contains 5 ecoli strains, let's try to find them using the blast feature.
- Click "Create/view Blast search" button
- Click "Build Blast database"
- Click "Load from fasta file" -> Select a reference genome in ~/data/mydatalocal/LongReads/references/
- Click "Run blast search"
- Recommanded: try to tune the blast filter (alignment length and identity)
Now, let's try to run metaMDBG on the zymo mock communities.
metaMDBG asm -h
Solution
Hifi:
metaMDBG asm --in-hifi ~/data/public/teachdata/ebame/metagenomics-assembly/SRR13128014_subreads.fastq.gz --out-dir ~/data/mydatalocal/LongReads/metaMDBG_asm_hifi --threads 4
ONT:
metaMDBG asm --in-ont ~/data/public/teachdata/ebame/metagenomics-assembly/SRR17913199_1.fastq.gz --out-dir ~/data/mydatalocal/LongReads/metaMDBG_asm_ont --threads 4
Let's wait for metaMDBG to finish a few multi-k iterations.
With the command "metaMDBG gfa", we can generate the assembly graph corresponding to each multi-k iteration. Lower-k graph will have more connectivity, while higher-k graph will have more repeats solved but also more fragmentation. It is interesting to work on lower-k graph if you have external source of data that could solve long repeat (for instance, HiC, ultra long reads, binning metrics).
Let' try to generate an assembly graph with a low k value. The following command shows the available values for k and their corresponding size in bps.
metaMDBG gfa --assembly-dir ~/data/mydatalocal/LongReads/metaMDBG_asm_ont --k 0
Choose a value for k and wait for metaMDBG to generate the graph.
Solution
metaMDBG gfa ~/data/mydatalocal/LongReads/metaMDBG_asm_ont/ --k 10
Visualize the assembly graph with Bandage
Solution
Bandage load ~/data/mydatalocal/LongReads/metaMDBG_asm_ont/assemblyGraph_k10_1813bps.gfa
Now let's check the final metaMDBG assembly results:
#Show metaMDBG output files
ls -lh ~/data/mydatalocal/LongReads/preruns/assembly/SRR13128014_hifi/metaMDBG/
Let's run Bandage and check how metaMDBG handles the ecoli strains:
#Decompress gfa file first, then run Bandage
gzip -d ~/data/mydatalocal/LongReads/preruns/assembly/SRR13128014_hifi/metaMDBG/assemblyGraph_k105_20813bps.gfa.gz
Bandage load ~/data/mydatalocal/LongReads/preruns/assembly/SRR13128014_hifi/metaMDBG/assemblyGraph_k105_20813bps.gfa
Let's focus here only on circular contigs.
For metaMDBG, we can check if a contig is circular by looking at the contig headers in the fasta files. If a header contains the field "circular=yes", it means that the contig is circular, otherwise it is linear. You can check this info with the following command:
#Show all headers
zcat ~/data/mydatalocal/LongReads/preruns/assembly/SRR13128014_hifi/metaMDBG/contigs.fasta.gz | grep ">"
#Show header with the circular flag = "yes"
zcat ~/data/mydatalocal/LongReads/preruns/assembly/SRR13128014_hifi/metaMDBG/contigs.fasta.gz | grep ">" | grep "circular=yes"
Metaflye uses an extra metadata file for contig information
#Show contig metadata
cat ~/data/mydatalocal/LongReads/preruns/assembly/SRR13128014_hifi/metaflye/assembly_info.txt
Let's create folders for the circular contigs:
mkdir -p ~/data/mydatalocal/HiFi/circularContigs/
mkdir -p ~/data/mydatalocal/HiFi/circularContigs/metaflye/
mkdir -p ~/data/mydatalocal/HiFi/circularContigs/metaMDBG/
We use a custom script to extract all circular contigs:
#Extract metaflye circular contigs (HiFi)
python3 ~/repos/Ebame/scripts/extractCircularContigs.py ~/data/mydatalocal/LongReads/preruns/assembly/SRR13128014_hifi/metaflye/assembly.fasta.gz ~/data/mydatalocal/LongReads/preruns/assembly/SRR13128014_hifi/metaflye/assembly_info.txt metaflye ~/data/mydatalocal/LongReads/circularContigs/metaflye/
#Extract metaMDBG circular contigs (HiFi)
python3 ~/repos/Ebame/scripts/extractCircularContigs.py ~/data/mydatalocal/LongReads/preruns/assembly/SRR13128014_hifi/metaMDBG/contigs.fasta.gz ~/data/mydatalocal/LongReads/preruns/assembly/SRR13128014_hifi/metaMDBG/contigs.fasta.gz metaMDBG ~/data/mydatalocal/LongReads/circularContigs/metaMDBG/
List the extracted circular contigs:
ls -lh ~/data/mydatalocal/LongReads/circularContigs/metaflye/
ls -lh ~/data/mydatalocal/LongReads/circularContigs/metaMDBG/
We are now going to merge the results of metaMDBG and metaflye. The idea is to compute the similarity (ANI) between the circular contigs, and to choose only one representative if two contigs are duplicated (ANI > 0.95 by default).
For this task, we are going to use the software dRep. dRep has a lot of options, let's craft the command together, first display de-replicate options:
dRep dereplicate -h
- First choose a output folder
- dRep takes a list of genomes as input (option -g), try to provide all the fasta file with regex
- Metaflye circular contigs are here: ~/data/mydatalocal/LongReads/circularContigs/metaflye/
- MetaMDBG circular contigs are here: ~/data/mydatalocal/LongReads/circularContigs/metaMDBG/
- Disable quality check with option --ignoreGenomeQuality (we'll do it after dereplication)
- Disable plotting with --skip_plots
- Try to select the "skani" method for comparing the circular contigs quickly
Solution
dRep dereplicate ~/data/mydatalocal/LongReads/drep_circular/ -p 4 -g ~/data/mydatalocal/LongReads/circularContigs/metaflye/*.fa ~/data/mydatalocal/LongReads/circularContigs/metaMDBG/*.fa --S_algorithm skani --ignoreGenomeQuality --skip_plots
Read dRep output information and try to list the folder containing dereplicated contigs.
Solution
ls -lh ~/data/mydatalocal/LongReads/drep_circular/dereplicated_genomes/
dRep provides a lot of useful information, for instance, we can look at the similarity between the pair of circular contigs:
column -s, -t < ~/data/mydatalocal/LongReads/drep_circular/data_tables/Ndb.csv
Are there any circular contigs which are only found by one assembler?
column -s, -t < ~/data/mydatalocal/LongReads/drep_circular/data_tables/Cdb.csv
Clearly some of these are not genomes, let's run checkm on the dereplicated contigs: (by the way, sorry, there is a checkm2 version now which is way faster and user-friendly)
checkm lineage_wf ~/data/mydatalocal/LongReads/drep_circular/dereplicated_genomes/ ~/data/mydatalocal/LongReads/checkm_output/ -t 4 --pplacer_threads 4 -r -x .fa --tab_table -f ~/data/mydatalocal/LongReads/checkm_results.tsv
CheckM is a bit slow, so let's check the prerun results
cat ~/data/mydatalocal/LongReads/preruns/checkm_results.tsv
Print columns corresponding to completeness and contamination:
awk -F"\t" '{ print $1, "\t", $12, "\t", $13 }' ~/data/mydatalocal/LongReads/preruns/checkm_results.tsv
Some of the smaller circular contigs are likely to be plasmids or virus. Let's use genomad, a machine learning approach to verify this.
As usual, let's check the software usage information:
genomad -h
genomad end-to-end -h
Genomad takes as input a single fasta file. It will then process the contigs one by one and determine how likely they are to be plasmids or virus. Let's concatenate the small dereplicated circular contigs in a single fasta file.
Solution
#Concatenate all circular contigs in a single fasta file
cat ~/data/mydatalocal/LongReads/drep_circular/dereplicated_genomes/*.fa > ~/data/mydatalocal/LongReads/allCircularContigs.fasta
#Concatenante only small circular contigs
find ~/data/mydatalocal/LongReads/drep_circular/dereplicated_genomes/*.fa -size -500k | xargs cat > ~/data/mydatalocal/LongReads/allSmallCircularContigs.fasta
The genomad database is located here:
~/data/public/teachdata/ebame/viral-metagenomics/genomad_db/
Let's run try to run genomad now.
Solution
genomad end-to-end ~/data/mydatalocal/LongReads/allSmallCircularContigs.fasta ~/data/mydatalocal/LongReads/genomad/ ~/data/public/teachdata/ebame/viral-metagenomics/genomad_db/ --conservative --threads 4
Read genomad logs and try to print plasmids and virus summaries: (pre-runs are located here: ~/data/mydatalocal/LongReads/preruns/genomad/)
Solution
cat ~/data/mydatalocal/LongReads/preruns/genomad/allSmallCircularContigs_summary/allSmallCircularContigs_plasmid_summary.tsv
cat ~/data/mydatalocal/LongReads/preruns/genomad/allSmallCircularContigs_summary/allSmallCircularContigs_virus_summary.tsv
By the way, there is a software called checkv to assess the quality of viral contigs.
Retry to use checkm on a contigs you chose and saved with Bandage.
--> find a long contigs from Bandage. Then click on "Output" menu -> "Save selected node sequence to FASTA"
Or from the command line, use the following command line replacing <NODE>:
Bandage reduce ~/data/mydatalocal/LongReads/preruns/assembly/SRR13128014_hifi/metaflye/assembly_graph.gfa ~/data/mydatalocal/LongReads/<Node>.gfa --scope aroundnodes --nodes <NODE> --distance 0
--> use a bash online to extract, name and sequence from that gfa graph:
awk '/^S/{print ">"$2"\n"$3}' ~/data/mydatalocal/LongReads/<Node>.gfa > ~/data/mydatalocal/LongReads/<Node>.fasta
--> use checkm