This repository contains a Snakemake pipeline for the assembly, scaffolding, analysis, annotation and quality assessment of HiFi-based assemblies. Although developed for a project in lettuce, the pipeline is designed to work with any organism. The pipeline will work with both HiFi and ONT data, although only the former is required.
- Downloading the pipeline
- Installing dependencies
- Required databases
- Configuring the pipeline
- Running the pipeline
- Output
- Explaining the pipeline
- Citation
- FAQ
The pipeline can be obtained via:
git clone https://github.com/dirkjanvw/MoGAAAP.git
cd MoGAAAP/
Should you notice that bugs have been fixed on GitHub or a new feature implemented in the pipeline, updating the pipeline is as simple as running the following in the MoGAAAP/
directory:
git pull
The pipeline will work on any Linux system where conda
/mamba
and singularity
/apptainer
are installed.
If not installed already, conda
/mamba
can be installed by following these instructions:
# install miniforge
curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
bash Miniforge3-$(uname)-$(uname -m).sh
# follow instructions and let it run `conda init`
# set default channels
source ~/.bashrc
conda config --set auto_activate_base false
source ~/.bashrc
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict
This installation process requires root access, but typically a server admin can install it for you if it is not already installed. If not installed already, Singularity/Apptainer can be installed by following the instructions on their website: apptainer.org. Make sure to install either Singularity version 3.7 or higher or Apptainer version 1.0 or higher.
NB: For running the pipeline, no root access or special permissions are required.
However, the pipeline needs some SIF files that are not included in the repository.
These need to be built using the provided DEF files, which requires sudo
permissions.
Building these SIF files can be done locally on a personal laptop (or on a server if you happen to have sudo
permissions).
Navigate to the workflow/singularity
directory and run the following command for each DEF file (file ending in .def
):
sudo singularity build ${NAME}.sif ${NAME}.def
NB: For Singularity/Apptainer to work properly, some environment variables need to be set.
The following ones are required to be set in your .profile
, .bashrc
or .bash_profile
:
SINGULARITY_BIND
/APPTAINER_BIND
: To bind the paths inside the container to the paths on your system; make sure all relevant paths are included (working directory, database directory, etc.).SINGULARITY_NV
/APPTAINER_NV
: To use the GPU inside the container; only required if you have a GPU.
It is also recommended to set the following environment variable:
SINGULARITY_CACHEDIR
/APPTAINER_CACHEDIR
: To store the cache of the container outside of your home directory.
Snakemake can be installed using conda
/mamba
:
mamba create -c conda-forge -c bioconda -n snakemake snakemake=8
Then activate the environment before running the pipeline:
conda activate snakemake
Next to Snakemake, conda
/mamba
and singularity
/apptainer
, this pipeline depends on the existence of a number of databases.
Database | Download instructions |
---|---|
Helixer model | Helixer GitHub |
GXDB database | Follow "Download the database" instructions on FCS GitHub wiki (I only tested the Cloud instructions) |
Kraken2 nt database | Download nt from this list |
OMA database | Download LUCA.h5 from this list |
All configuration of the pipeline is done in the config/config.yaml
file, and samples are registered in the config/samples.tsv
file.
All fields to fill in are well-documented in the provided config/config.yaml
file and should be self-explanatory.
The config/samples.tsv
has the following columns to fill in (one row per sample):
Column name | Description |
---|---|
accessionId |
The accession ID of the sample. This name has to be unique. |
hifi |
The path to the HiFi reads in FASTQ or FASTA format. Multiple libraries can be provided by separating them with a semicolon. |
ont |
OPTIONAL. The path to the ONT reads in FASTQ or FASTA format. Multiple libraries can be provided by separating them with a semicolon. |
illumina_1 |
OPTIONAL. The path to the forward Illumina reads in FASTQ format. |
illumina_2 |
OPTIONAL. The path to the reverse Illumina reads in FASTQ format. |
hic_1 |
OPTIONAL. The path to the forward Hi-C reads in FASTQ format. |
hic_2 |
OPTIONAL. The path to the reverse Hi-C reads in FASTQ format. |
haplotypes |
The expected number of haplotypes in the assembly. Use 1 for (near) homozygous accessions and 2 for heterozygous accessions. NB: currently only 1 or 2 is supported. |
speciesName |
A name for the species that is used by Helixer to name novel genes. |
taxId |
The NCBI taxonomy ID of the species. |
referenceId |
A unique identifier for the reference genome for which genome (FASTA), annotation (GFF3) and chromosome names are provided in the config/config.yaml file. |
Both config/config.yaml
and config/samples.tsv
files validated against a built-in schema that throws an error if the files are not correctly filled in.
Several modules are available in this pipeline (will be referred to later as ${MODULE}
):
assemble
: This module will only assembly the reads into contigs.scaffold
: This module will scaffold the contigs usingntJoin
against a provided reference genome.analyse
: This module will analyse the assembly for provided genes, sequences and contamination.annotate
: This module will generate a quick-and-dirty annotation of the assembly usingliftoff
andhelixer
.qa
: This module will perform quality assessment of the scaffolded assembly and the quick-and-dirty annotation.all
: This module will run all the above modules.
It is advisable to run the pipeline module by module for a new set of assemblies and critically look at the results of each module before continuing.
All modules except for annotate
have visual output that can be inspected in an HTML report file (see at Reporting).
For more information about these modules, see Explaining the pipeline.
Several important snakemake
parameters are important when running this pipeline, but most have already been set by default (in workflow/profiles/default/config.yaml
).
Parameter | Optionality | Description |
---|---|---|
-n |
Optional | Do a dry-run of the pipeline. |
-p |
Set by default to True | Print the shell commands that are being executed. |
-c /--cores |
Set by default to all | Number of CPUs to use. |
--use-conda |
Set by default to True | Use conda /mamba to manage dependencies. |
--conda-prefix |
Optional | Path where the conda environments will be stored. |
--use-singularity |
Set by default to True | Use singularity /apptainer to manage containers. |
--singularity-prefix |
Optional | Path where the singularity images will be stored. |
--resources |
Set by default | Information about system resources; see below at Resources. |
The following resources (apart from CPUs) might be heavily used by the pipeline:
gbmem
: The amount of memory in GB that RAM-heavy jobs in the pipeline can use. It is recommended to keep this on the lower side as only some jobs of the pipeline use this (to keep RAM for other jobs), but at least 500 GB is required. Default: 1000 (GB).helixer
: The number of Helixer jobs that can run at the same time. It is recommended to always keep this at 1 (small server), 2 (large server) or the number of GPUs divided by 2 (GPU server). Default: 1.pantools
: The number of PanTools jobs that can run at the same time. It is recommended to always keep this at 1, to prevent file collisions. Default: 1.
As first step, it is always good to do a dry-run to check if everything is set up correctly:
snakemake ${MODULE} -n
If everything is alright, the pipeline can be run:
snakemake ${MODULE}
The pipeline can generate an HTML report.html
file with the most important results:
snakemake ${MODULE} --report report.html
Next to the report generated by Snakemake, the most important outputs of the pipeline are the genome assembly and annotation.
These can be found in the directory final_output
.
(All temporary files are stored in the results
directory.)
The final_output
directory contains the following files:
File name | Description |
---|---|
${accessionId}.contigs.fa |
The contigs produced by the assemble module. |
${accessionId}.full.fa |
The scaffolded assembly produced by the scaffold module. |
${accessionId}.nuclear.fa |
The scaffolded assembly produced by the scaffold module, but only nuclear contigs as obtained from the analyse module. |
${accessionId}.${organelle}.fa |
The scaffolded assembly produced by the scaffold module, but only organellar contigs as obtained from the analyse module. |
${accessionId}.full.gff |
The quick-and-dirty annotation produced by the annotate module; belongs to ${accessionId}.full.fa . |
${accessionId}.full.coding.gff |
The quick-and-dirty annotation produced by the annotate module, but only coding genes; belongs to ${accessionId}.full.fa . |
Assembling a genome from raw data to a final usable resource is a process that is hard to automate. We believe that this process always necessesitates human curation. However, large parts can easily be automated, which is why we created this pipeline. This pipeline performs the assembly, scaffolding and renaming of genomic data as well as an initial quick-and-dirty structural annotation. Importantly, both genome assembly and annotation are subjected to quality assessment, providing a direct starting point for the curation of the assembly. Furthermore, each part of the process (module) can be run separately after which its output can be inspected before continuing to the next step.
graph TD;
assemble-->scaffold;
scaffold-->analyse;
scaffold-->annotate;
scaffold-->qa;
annotate-->qa;
In the assemble module, HiFi reads are assembled using hifiasm
.
If ONT reads are given, these are used in the assembly process using the --ul
parameter of hifiasm
.
Also, if Hi-C reads are given, these are used in the assembly process.
Since the output of hifiasm
is a GFA file, we next convert the (consensus) primary contigs GFA to a FASTA file.
In case the user has indicated that the accession is heterozygous, the two haplotype assemblies as outputted by hifiasm
are converted to FASTA files instead.
Finally, we produce an alignment of the (contig) assembly against the provided reference genome using nucmer
.
To prevent spurious alignments, we slightly increased the -l
and -g
parameter of nucmer
.
As alternative to hifiasm
, we also implemented verkko
as it is known to work well with HiFi (and ONT and Hi-C) data.
For heterozygous accessions, hapdup
is used to try separating the haplotypes based on HiFi alignment to the Verkko assembly.
Please bear in mind that the pipeline was developed with hifiasm
in mind, so although these other assemblers will technically work, the pipeline may not be optimally set up for them.
In some preliminary tests, we found that verkko
doesn't work well with heterozygous accessions, resulting in a partly phased assembly.
Since the next step after assembly is the scaffolding process, there has to be collinearity between the assembly and the reference genome.
This can be checked in the dotplot created from the nucmer
alignment.
If there is no sign of collinearity between the two, reference-guided scaffolding will be impossible.
The only solution in that case would be to choose another (more closely related) reference genome.
Scaffolding is performed using ntJoin
, which uses a minimizer-based reference-guided scaffolding method.
If by visual inspection collinearity between the assembly and reference genome was found, the scaffolding module generally runs without issues.
Should any error occur, please read the corresponding log file of the step that produced the error.
In most cases, the error may be resolved by choosing different values for the ntjoin_k
and ntjoin_w
in the config file.
In our experience, increasing the value for ntjoin_w
resolves most issues when no correct scaffolding is produced.
After scaffolding, the sequences in the scaffolded assembly are renamed to reflect their actual chromosome names according to the reference genome.
Finally, nucmer
is run again to produce an alignment plot for visual inspection of the scaffolding process.
If Hi-C reads were provided, the Hi-C contact map is also produced for visual inspection of the ntJoin
scaffolding process.
As the assembly as outputted by this module is used as starting point for the analyse, annotate and qa modules, it is crucial it matches the expectations in terms of size and chromosome number.
Please carefully look at the nucmer
alignment plot to check that the assembly looks as expected before continuing to a next module.
The purpose of the analyse module is to create a genome-wide overview of the newly created assembly.
It does this by running several user-defined queries (including organelle search) against the assembly using BLAST as well as a search for the telomere repeat sequence (please see this note in the FAQ for more information on telomere identification).
The results of these queries are visualised in an HTML report file.
Next to this, the user-defined queries are used to create a template circos
configuration and plot.
Typically, user-defined queries are used to either identify unwanted sequences (known contaminants or organelles) or to identify sequences that are expected to be present in the assembly (e.g. resistance genes). The resulting tables in the report may be used to filter out unwanted sequences from the assembly, and to check the quality of the assembly based on the expected sequences (such as making sure the telomere repeat sequence is present at the tails of the chromosomes).
Circos plots are notoriously hard to automate and that is no different for this pipeline.
Although a circos
plot should always be produced, it typically doesn't look quite right yet.
Feel free to copy the config files produced by this pipeline and adjust to your own plotting needs.
Proper structural genome annotation would take too long and is not a problem that is solved for automation yet.
Therefore, we implemented a "quick-and-dirty" annotation in this pipeline by combing the results of liftoff
and helixer
.
helixer
will run on the GPU if it's available, otherwise it will run on CPU (which is known to be a lot slower).
In case of overlap in features between liftoff
and helixer
, we take the liftoff
annotation.
Although this module generally runs for the longest time, no visual output is produced; only GFF3 file (one unfiltered and one with only coding genes) belonging to the scaffolded assembly. This GFF3 file, together with the FASTA file from the scaffold module are the only inputs for the final module: QA.
This final quality assessment module is the most important for human curation of the genome.
The quality assessment steps in this module can be roughly divided into two categories: individual and grouped.
Individual quality assessment steps include k-mer completeness (merqury
), k-mer contamination (kraken2
), NCBI contamination (fcs-gx
), adapter contamination (fcs-adaptor
) and read mapping (bwa-mem2
).
Grouped quality assessment steps include BUSCO completeness (busco
), OMA completeness (omark
), k-mer distances (kmer-db
), mash distances (mash
), minimizer collinearity (ntsynt
), k-mer phylogeny (SANS
), k-mer pangenome growth (pangrowth
), gene pangenome growth (pantools
) and general statistics.
These groups are meant to give a comparative overview of the assembly and annotation.
Any groups can be defined in the configuration file and a genome may occur in multiple groups.
The report (see Reporting) produced by this module is the most useful output of the pipeline for human curation. It contains visual output for each of the quality assessment steps performed in this module including a description on how to interpret the results. It also calculates various statistics that are reported on in the report. Importantly, the qa module does not do any filtering of the assembly or annotation, only reporting. Next steps could include (but are not limited to) removal of contaminants, discovery of sample swaps, subsetting the input data, etc.
If you use MoGAAAP in your work, please cite this work as:
in prep.
A: This issue typically arises when the assembly and reference genome are not collinear because of an evolutionary distance that is too large.
In this case, the pipeline is not able to accurately discern which reference chromosome corresponds to which assembly scaffold.
This lack of collinearity should also be visible in the MUMmerplots created by the assemble
module.
The only solution in this case would be to choose another (more closely related) reference genome and re-run the assemble
module to check if this new reference genome is collinear with the assembly before continuing with the scaffold
module.
A: This issue can have multiple causes, but the most common one is that the ntJoin
parameters are not set correctly.
From our own experience, increasing the value for ntjoin_w
resolves most issues when no correct scaffolding is produced.
Also, it is important to keep in mind that this pipeline is not meant to create a perfect assembly, but to provide a starting point for human curation.
So feel free to adjust the pipeline or the assembly to your own needs!
A: This is intended Snakemake behaviour: if any job fails, the pipeline will only finish current running jobs and then exit. As for the reason of stopping, please check the log file of the job that failed. The name of the log file will be printed in the terminal output of the pipeline. If the error is not clear, please open an issue on this GitHub page.
A: Make sure that all SIF containers are built (see Singularity/Apptainer), as all dependencies are included in either a conda
environment or a singularity
container.
If the issue is with a conda
environment, please report it as an issue on this GitHub page.
A: This is likely due to missing environment variables for Singularity/Apptainer. See Singularity/Apptainer for more information on which environment variables need to be set.
A: This is a known issue of the Snakemake report HTML. The current workaround is to run:
sed -E 's/([^l]) h-screen/\1/g' report.html > report_fixed.html
A: While seqtk
is more accurate in the boundaries of the telomere search, it cannot identify telomeres that are not at the ends of the chromosomes.
Therefore, we recommend to also run BLASTN with a fasta file containing 100x the telomere repeat sequence for identification of telomeres that are not at the ends of the chromosomes.
If the above information does not answer your question or solve your issue, feel free to open an issue on this GitHub page.