EGAP (Entheome Genome Assembly Pipeline) is a versatile bioinformatics pipeline developed for assembling high-quality hybrid genomes using Oxford Nanopore Technologies (ONT) and Illumina sequencing data. It also supports de novo and reference-based assemblies using Illumina data alone. The pipeline encompasses comprehensive steps for read quality control, trimming, genome assembly, polishing, and scaffolding. While optimized for fungal genomes, EGAP can be customized to work with other types of organisms.
- Overview
- Installation
- Pipeline Flow
- Command-Line Usage
- CSV Generation
- Example Data & Instructions
- Future Improvements
- References
The shell script will ensure that Python 3.8 and required libraries are installed. The pipeline has dependencies on a variety of bioinformatics tools, including but not limited to:
- Trimmomatic
- BBMap
- FastQC
- NanoPlot
- Filtlong
- Ratatosk
- MaSuRCA
- Racon
- Burrows-Wheeler Aligner
- SamTools
- BamTools
- Pilon
- purge_dupes
- RagTag
- TGS-GapCloser
- ABYSS-Sealer
- QUAST
- CompleAsm
- Merqury
You can install pre-requisites using the shell script:
bash /path/to/EGAP_setup.sh
- `--input_csv`, `-csv` (str): Path to a CSV containing multiple sample data. (default = None)
- `--raw_ont_dir`, `-odir` (str): PPath to a directory containing all Raw ONT Reads. (if `-csv` = None; else REQUIRED)
- `--raw_ont_reads`, `-i0` (str): Path to the combined Raw ONT FASTQ reads. (if `-csv` = None; else REQUIRED)
- `--raw_illu_dir`, `-idir` (str): Path to a directory containing all Raw Illumina Reads. (if `-csv` = None; else REQUIRED)
- `--raw_illu_reads_1`, `-i1` (str): Path to the Raw Forward Illumina Reads. (if `-csv` = None; else REQUIRED)
- `--raw_illu_reads_2`, `-i2` (str): Path to the Raw Reverse Illumina Reads. (if `-csv` = None; else REQUIRED)
- `--species_id`, `-ID` (str): Species ID formatted as `<2-letters of Genus>_`. (if `-csv` = None; else REQUIRED)
- `--organism_kingdom`, `-Kg` (str): Kingdom the current organism data belongs to. (default: Funga)
- `--organism_karyote`, `-Ka` (str): Karyote type of the organism. (default: Eukaryote)
- `--compleasm_1`, `-c1` (str): Name of the first organism compleasm/BUSCO database to compare to. (default: basidiomycota)
- `--compleasm_2`, `-c2` (str): Name of the second organism compleasm/BUSCO database to compare to. (default: agaricales)
- `--est_size`, `-es` (str): Estimated size of the genome in Mbp (million base pairs). (default: 60m)
- `--ref_seq`, `-rf` (str): Path to the reference genome for assembly. (default: None)
- `--percent_resources`, `-R` (float): Percentage of resources for processing. (default: 1.00)
python /path/to/EGAP.py --raw_ont_reads /path/to/ont_reads.fq.gz \
--raw_illu_dir /path/to/illumina_reads/ \
--species_id AB_speciesname \
--organism_kingdom Funga \
--organism_karyote Eukaryote \
--compleasm_1 basidiomycota \
--compleasm_2 agaricales \
--est_size 60m \
--percent_resources 0.8
Alternatively, using a CSV file for multiple samples:
python /path/to/EGAP.py --input_csv /path/to/samples.csv
To run EGAP with multiple samples, you can provide a CSV file containing the necessary information for each sample. Below is the correct format for the CSV file:
The CSV file should have the following header and columns:
ONT_RAW_DIR | ONT_RAW_READS | ILLUMINA_RAW_DIR | ILLUMINA_RAW_F_READS | ILLUMINA_RAW_R_READS | SPECIES_ID | ORGANISM_KINGDOM | ORGANISM_KARYOTE | COMPLEASM_1 | COMPLEASM_2 | EST_SIZE | REF_SEQ |
---|---|---|---|---|---|---|---|---|---|---|---|
None | /path/to/ONT/sample1.fq.gz | None | /path/to/Illumina/sample1_R1.fq.gz | /path/to/Illumina/sample1_R2.fq.gz | AB_sample1 | Funga | Eukaryote | basidiomycota | agaricales | 60m | /path/to/ref_genome1.fasta |
/path/to/ONT | None | /path/to/Illumina | None | None | AB_sample2 | Funga | Eukaryote | basidiomycota | agaricales | 55m | /path/to/ref_genome2.fasta |
- ONT_RAW_DIR: Path to the directory containing all Raw ONT Reads. Use
None
if specifying individual read files. - ONT_RAW_READS: Path to the combined Raw ONT FASTQ reads (e.g.,
/path/to/ONT/sample1.fq.gz
). - ILLUMINA_RAW_DIR: Path to the directory containing all Raw Illumina Reads. Use
None
if specifying individual read files. - ILLUMINA_RAW_F_READS: Path to the Raw Forward Illumina Reads (e.g.,
/path/to/Illumina/sample1_R1.fq.gz
). - ILLUMINA_RAW_R_READS: Path to the Raw Reverse Illumina Reads (e.g.,
/path/to/Illumina/sample1_R2.fq.gz
). - SPECIES_ID: Species ID formatted as
<2-letters of Genus>_<full species name>
(e.g.,AB_sample1
). - ORGANISM_KINGDOM: Kingdom the current organism data belongs to (default:
Funga
). - ORGANISM_KARYOTE: Karyote type of the organism. (default: Eukaryote).
- COMPLEASM_1: Name of the first organism compleasm/BUSCO database to compare to. (default: basidiomycota).
- COMPLEASM_2: Name of the second organism compleasm/BUSCO database to compare to. (default: agaricales).
- EST_SIZE: Estimated size of the genome in Mbp (million base pairs) (e.g.,
60m
). - REF_SEQ: Path to the reference genome for assembly. Use
None
if not applicable.
ONT_RAW_DIR,ONT_RAW_READS,ILLUMINA_RAW_DIR,ILLUMINA_RAW_F_READS,ILLUMINA_RAW_R_READS,SPECIES_ID,ORGANISM_KINGDOM,ORGANISM_KARYOTE,COMPLEASM_1,COMPLEASM_2,EST_SIZE,REF_SEQ
None,/mnt/d/EGAP/EGPA_Processing/Ps_zapotecorum/ONT/SRR########.fastq.gz,None,/mnt/d/EGAP/EGAP_Processing/Ps_zapotecorum/Illumina/SRR########_1.fq.gz,/mnt/d/EGAP/EGAP_Processing/Ps_zapotecorum/Illumina/SRR########_2.fq.gz,Ps_zapotecorum,Funga,Eukaryote,basidiomycota,agaricales,60m,None
None,/mnt/d/EGAP/EGAP_Processing/Ps_gandalfiana/ONT/SRR########.fastq.gz,/mnt/d/EGAP/EGAP_Processing/Ps_gandalfiana/Illumina/B1_3,None,None,Ps_gandalfiana,Funga,Eukaryote,basidiomycota,agaricales,60m,/mnt/d/EGAP/EGAP_Processing/Ps_gandalfiana/GCF_#########_#.fna
- If you provide a value for
ILLUMINA_RAW_DIR
, setILLUMINA_RAW_F_READS
andILLUMINA_RAW_R_READS
toNone
. EGAP will automatically detect and process all paired-end reads within the specified directory. This is also True if for if you provideONT_RAW_DIR
. - Ensure that all file paths are correct and accessible.
- The CSV file should not contain any extra spaces or special characters in the headers.
First, create the main processing folder with the required sub-folders; change "EGAP_Processing" or add your own organism specific folder as needed:
mkdir -p EGAP_Processing/ONT EGAP_Processing/Illumina && \
cd EGAP_Processing
Ps. cubensis var. Golden Teacher assembled with reference to Ps. cubensis var. PE Reference Sequence.
Download the Reference Sequence into main processing folder:
datasets download genome accession GCF_017499595.1 --include genome,seq-report && \
unzip ncbi_dataset
Download the Illumina data into the Illumina folder (split into multiple files):
cd Illumina && \
prefetch SRR13870478 && \
fastq-dump --gzip --split-files SRR13870478 && \
rm -rf SRR13870478 && \
cd ..
Adjust the paths to correctly match the downloaded files.
python /mnt/d/EGAP/EGAP.py --raw_illu_reads_1 /mnt/d/EGAP/EGAP_Processing/Illumina/SRR13870478_1.fastq.gz \
--raw_illu_reads_2 /mnt/d/EGAP/EGAP_Processing/Illumina/SRR13870478_2.fastq.gz \
--species_id Ps_cubensis \
--organism_kingdom Funga \
--organism_karyote Eukaryote \
--compleasm_2 basidiomycota \
--compleasm_1 agaricales \
--est_size 60m \
--ref_seq /mnt/d/EGAP/EGAP_Processing/ncbi_dataset/data/GCF_017499595.1/GCF_017499595.1_MGC_Penvy_1_genomic.fna
Ps. caeruleorhiza
Download the ONT data into the ONT folder:
cd ONT && \
prefetch SRR13870478 && \
fastq-dump --gzip SRR27945394 && \
rm -rf SRR27945394 && \
cd ..
Download the Illumina data into the Illumina folder (split into multiple files):
cd Illumina && \
prefetch SRR13870478 && \
fastq-dump --gzip --split-files SRR27945395 && \
rm -rf SRR27945395 && \
cd ..
Adjust the paths to correctly match the downloaded files.
python /mnt/d/EGAP/EGAP.py --raw_ont_reads /mnt/d/EGAP/EGAP_Processing/ONT/SRR27945394.fastq.gz \
--raw_illu_reads_1 /mnt/d/EGAP/EGAP_Processing/Illumina/SRR27945395_1.fastq.gz \
--raw_illu_reads_2 /mnt/d/EGAP/EGAP_Processing/Illumina/SRR27945395_2.fastq.gz \
--species_id Ps_caeruleorhiza \
--organism_kingdom Funga \
--organism_karyote Eukaryote \
--compleasm_2 basidiomycota \
--compleasm_1 agaricales \
--est_size 60m
- Docker Integration: Generate a Dockerfile for alternative installation option.
- Automated Quality Assessment Reports: Generate comprehensive quality reports post-assembly for easier analysis.
- Improved Data Management: Removal of excess files once pipeline complete.
- Enhanced Support for Diverse Genomes: Optimize pipeline parameters for non-fungal genomes to improve versatility.
- Improved Error Handling: Develop more robust error detection and user-friendly feedback mechanisms.
- Integration with Additional Sequencing Platforms: Expand support beyond ONT and Illumina to include platforms like PacBio.
This pipeline was modified From two of the following pipelines:
Bollinger IM, Singer H, Jacobs J, Tyler M, Scott K, Pauli CS, Miller DR,
Barlow C, Rockefeller A, Slot JC, Angel-Mosti V. High-quality draft genomes
of ecologically and geographically diverse Psilocybe species. Microbiol Resour
Announc 0:e00250-24. https://doi.org/10.1128/mra.00250-24
Muñoz-Barrera A, Rubio-Rodríguez LA, Jáspez D, Corrales A , Marcelino-Rodriguez I,
Lorenzo-Salazar JM, González-Montelongo R, Flores C. Benchmarking of bioinformatics
tools for the hybrid de novo assembly of human whole-genome sequencing data.
bioRxiv 2024.05.28.595812; doi: https://doi.org/10.1101/2024.05.28.595812
The example data are published in:
Bollinger IM, Singer H, Jacobs J, Tyler M, Scott K, Pauli CS, Miller DR,
Barlow C, Rockefeller A, Slot JC, Angel-Mosti V. High-quality draft genomes
of ecologically and geographically diverse Psilocybe species. Microbiol Resour
Announc 0:e00250-24. https://doi.org/10.1128/mra.00250-24
McKernan K, Kane L, Helbert Y, Zhang L, Houde N, McLaughlin S. A whole genome
atlas of 81 Psilocybe genomes as a resource for psilocybin production. F1000Research
2021, 10:961; doi: https://doi.org/10.12688/f1000research.55301.2
If you would like to contribute to the EGAP Pipeline, please submit a pull request or open an issue on GitHub. For major changes, please discuss them with us first via an issue.
This project is licensed under the MIT License.