Eukfinder is a modular pipeline for classifying WGS metagenomic data and recovering potential eukaryotic sequences. It supports both Illumina short reads (Eukfinder_short) and assemblies or long-read data (Eukfinder_long).
Key Features:
- Automated classification of potential eukaryotic sequences.
- Flexible design for short-read, long-read, or assembly data.
- Optional binning workflow for refining nuclear and mitochondrial genomes.
- Customizable databases for different environments (e.g., gut, ocean, soil).
Eukfinder has two different modes of operation based on the input files:
-
(a) Illumina short reads workflow (Eukfinder_short): Short reads are first classified into five taxonomic categories (Archaeal, Bacterial, Viral, Eukaryotic, and Unknown) using Centrifuge (DB1) and PLAST (DB2). Reads classified as 'Eukaryotic' or 'Unknown' are assembled into contigs using metaSpades. These contigs are then reclassified with Centrifuge and PLAST. Contigs assigned as 'Eukaryotic' or 'Unknown' are combined and treated as potential eukaryotic sequences, which can be further analyzed for downstream binning and genome recovery.
-
(b) Metagenome assembled contigs or long-read sequencing workflow (Eukfinder_long): For MAG assembled contigs or long-read sequencing data generated by Nanopore or PacBio platforms, the workflow performs a single round of classification to select 'Eukaryotic' and 'Unknown' contigs. These selected contigs are combined and treated as potential eukaryotic sequences, ready for further binning and downstream analysis.
Schematic representation of Eukfinder pipeline:
The Eukfinder documentation is found on the wiki site.
All feedback is appreciated! Please open an issue on this repository if you would like to ask a question or make a comment.
Zhao, D., Salas-Leiva, D.E., Williams, S.K., Dunn, K.A. and Roger, A.J., 2023. Eukfinder: a pipeline to retrieve microbial eukaryote genomes from metagenomic sequencing data. bioRxiv, pp.2023-12.
Dandan Zhao (d.zhao@dal.ca) Dayana Salas-Leiva (ds2000@cam.ac.uk)