Merge branch 'main' of https://github.com/eric9n/kraken2-rust

eric9n · Jun 26, 2024 · b076687 · b076687
2 parents e922dc3 + 2eb7157
commit b076687
Show file tree

Hide file tree

Showing 2 changed files with 165 additions and 58 deletions.
diff --git a/README.md b/README.md
@@ -1,53 +1,119 @@
-We developed Kun-peng, an accurate and highly scalable low-memory tool for classifying metagenomic sequences. Inspired by Kraken2's k-mer-based approach, Kun-peng incorporates an advanced sliding window algorithm during sample classification and, crucially, employs an ordered chunks method when building the reference database. This approach allows the database to be constructed in the format of sub-databases of any desired chunk size, significantly reducing running memory usage by orders of magnitude. These improvements enable running Kun-peng on personal computers and HPC platforms alike. In practice, for any larger indices, the Kun-peng would allow the taxonomic classification task to be executable on essentially all computing platforms without the need for the traditionally expensive and rare high-memory node. Importantly, the flexible structure of the reference index also allows the construction and utilization of supermassive indices that were previously infeasible due to computational restraints. Supermassive indices, incorporating the growing genomic data from prokaryotes and eukaryotes, as well as metagenomic assemblies, are crucial in investigating the more diverse and complex environmental metagenomes, such as the exposome research. The name "Kun-peng" is a massive mythical creature capable of transforming from a giant fish in the water (Kun) to a giant bird in the sky (Peng) from Chinese mythology, reflecting the flexible nature and capacity of the software to efficiently navigate the vast and complex landscapes of metagenomic data.
+# Kun-peng <img src="./docs/KunPeng.png" align="right" width="140"/>
 
-![示例图片](./docs/Picture1.png)
+[![](https://img.shields.io/badge/doi-waiting-yellow.svg)]() [![](https://img.shields.io/badge/release%20version-0.5.6-green.svg)](https://github.com/eric9n/kraken2-rust/releases)
 
-## Prerequisites
+We developed Kun-peng, an accurate and highly scalable low-memory tool for classifying metagenomic sequences.
 
-1. **Rust**: This project requires the Rust programming environment if you plan to build from source. If you are downloading the pre-built binaries, you do not need to install Rust.
+Inspired by Kraken2's k-mer-based approach, Kun-peng incorporates an advanced sliding window algorithm during sample classification and, crucially, employs an ordered chunks method when building the reference database. This approach allows the database to be constructed in the format of sub-databases of any desired chunk size, significantly reducing running memory usage by orders of magnitude. These improvements enable running Kun-peng on personal computers and HPC platforms alike. In practice, for any larger indices, the Kun-peng would allow the taxonomic classification task to be executable on essentially all computing platforms without the need for the traditionally expensive and rare high-memory node.
 
+Importantly, the flexible structure of the reference index also allows the construction and utilization of supermassive indices that were previously infeasible due to computational restraints. Supermassive indices, incorporating the growing genomic data from prokaryotes and eukaryotes, as well as metagenomic assemblies, are crucial in investigating the more diverse and complex environmental metagenomes, such as the exposome research.
+
+The name "Kun-peng" is a massive mythical creature capable of transforming from a giant fish in the water (Kun) to a giant bird in the sky (Peng) from Chinese mythology, reflecting the flexible nature and capacity of the software to efficiently navigate the vast and complex landscapes of metagenomic data.
+
+![Workflow of Kun-peng](./docs/Picture1.png)
 
 ## Get Started
-Follow these steps to build the projects and run the example.
 
-### Method 1: Download Pre-built Binaries
-If you prefer not to build from source, you can download the pre-built binaries for your platform from the GitHub releases page.
+Follow these steps to install Kun-peng and run the examples.
 
-### Method 2: Clone the Repository and Build the project
-First, clone this repository to your local machine:
+### Method 1: Download Pre-built Binaries (Recommended)
+
+If you prefer not to build from source, you can download the pre-built binaries for your platform from the GitHub [releases page](https://github.com/eric9n/kraken2-rust/releases).
 
-```sh
+``` bash
+mkdir kun_peng_v0.5.6
+tar -xvf kraken2-rust-v0.5.6-centos7.tar.gz -C kun_peng_v0.5.6
+# Add environment variable
+echo 'export PATH=$PATH:~/biosoft/kun_peng_v0.5.6' >> ~/.bashrc
+source ~/.bashrc
+```
+
+#### Run the `kun_peng` example
+
+We will use a very small virus database on the GitHub homepage as an example:
+
+1.  download database
+
+``` sh
 git clone https://github.com/eric9n/kraken2-rust.git
 cd kraken2-rust
 ```
 
+2.  build database
+
+``` sh
+kun_peng build --download-dir data/ --db test_database
+```
+
+```         
+merge fna start...
+merge fna took: 29.998258ms
+estimate start...
+estimate count: 14080, required capacity: 31818.0, Estimated hash table requirement: 124.29KB
+convert fna file "test_database/library.fna"
+process chunk file 1/1: duration: 29.326627ms
+build k2 db took: 30.847894ms
+```
+
+3.  classify
+
+``` sh
+# temp_chunk is used to store intermediate files
+mkdir temp_chunk
+# test_out is used to store output files
+mkdir test_out
+kun_peng classify --db test_database data/COVID_19.fa --chunk-dir temp_chunk --output-dir test_out
+```
+
+```         
+hash_config HashConfig { value_mask: 31, value_bits: 5, capacity: 31818, size: 13051, hash_capacity: 1073741824 }
+splitr start...
+splitr took: 18.212452ms
+annotate start...
+chunk_file "temp_chunk/sample_1.k2"
+load table took: 548.911µs
+annotate took: 12.006329ms
+resolve start...
+resolve took: 39.571515ms
+Classify took: 92.519365ms
+```
+
+### Method 2: Clone the Repository and Build the project
+
+#### Prerequisites
+
+1.  **Rust**: This project requires the Rust programming environment if you plan to build from source.
 
-### Build the Projects
+#### Build the Projects
 
-First, ensure that both projects are built. You can do this by running the following command from the root of the workspace:
+First, clone this repository to your local machine:
 
-```sh
+``` sh
+git clone https://github.com/eric9n/kraken2-rust.git
+cd kraken2-rust
+```
+
+Ensure that both projects are built. You can do this by running the following command from the root of the workspace:
+
+``` sh
 cargo build --release
 ```
 
 This will build the kr2r and ncbi project in release mode.
 
+#### Run the `kun_peng` example
 
-### Run the `kun_peng` Example
-
-Next, run the example script that demonstrates how to use the kun_peng binary. Execute the following command from the root of the workspace:
+Next, run the example script that demonstrates how to use the `kun_peng` binary. Execute the following command from the root of the workspace:
 
-```sh
+``` sh
 cargo run --release --example build_and_classify --package kr2r
 ```
 
 This will run the build_and_classify.rs example located in the kr2r project's examples directory.
 
+Example Output You should see output similar to the following:
 
-Example Output
-You should see output similar to the following:
-
-```txt
+``` txt
 Executing command: /path/to/workspace/target/release/kun_peng build --download-dir data/ --db test_database
 kun_peng build output: [build output here]
 kun_peng build error: [any build errors here]
@@ -57,40 +123,39 @@ kun_peng direct output: [direct output here]
 kun_peng direct error: [any direct errors here]
 ```
 
-This output confirms that the kun_peng commands were executed successfully and the files were processed as expected.
+This output confirms that the `kun_peng` commands were executed successfully and the files were processed as expected.
 
+#### Run the `ncbi` Example 
 
-Run the `ncbi` Example
 Run the example script in the ncbi project to download the necessary files. Execute the following command from the root of the workspace:
 
-```sh
+``` sh
 cargo run --release --example run_download --package ncbi
 ```
 
 This will run the run_download.rs example located in the ncbi project's examples directory. The script will:
 
-1. Ensure the necessary directories exist.
-2. Download the required files using the ncbi binary with the following commands:
-  * ./target/release/ncbi -d downloads gen -g archaea
-  * ./target/release/ncbi -d downloads tax
+1.  Ensure the necessary directories exist.
+2.  Download the required files using the ncbi binary with the following commands:
 
+-   ./target/release/ncbi -d downloads gen -g archaea
+-   ./target/release/ncbi -d downloads tax
 
-Example Output
-You should see output similar to the following:
+Example Output You should see output similar to the following:
 
-```txt
+``` txt
 Executing command: /path/to/workspace/target/release/ncbi -d /path/to/workspace/downloads gen -g archaea
 NCBI binary output: [download output here]
 
 Executing command: /path/to/workspace/target/release/ncbi -d /path/to/workspace/downloads tax
 NCBI binary output: [download output here]
 ```
 
-
 ## ncbi tool
 
 The ncbi binary is used to download resources from the NCBI website. Here is the help manual for the ncbi binary:
-```sh
+
+``` sh
 ./target/release/ncbi -h
 ncbi download resource
 
@@ -108,10 +173,9 @@ Options:
   -V, --version                      Print version
 ```
 
-
 ## kun_peng tool
 
-```sh
+``` sh
 Usage: kun_peng <COMMAND>
 
 Commands:
@@ -131,12 +195,11 @@ Options:
   -V, --version  Print version
 ```
 
-
 ### build database
 
 Build the kun_peng database like Kraken2, specifying the directory for the data files downloaded from NCBI, as well as the database directory.
 
-```sh
+``` sh
 ./target/release/kun_peng build -h
 build database
 
@@ -173,24 +236,25 @@ Options:
           Print version
 ```
 
-
 ### classify
 
 The classification process is divided into three modes:
 
-1. Direct Processing Mode:
+1.  Direct Processing Mode:
+
+-   Description: In this mode, all database files are loaded simultaneously, which requires a significant amount of memory. Before running this mode, you need to check the total size of hash\_\*.k2d files in the database directory using the provided script. Ensure that your available memory meets or exceeds this size.
 
-* Description: In this mode, all database files are loaded simultaneously, which requires a significant amount of memory. Before running this mode, you need to check the total size of hash_*.k2d files in the database directory using the provided script. Ensure that your available memory meets or exceeds this size.
-```sh
+``` sh
 bash cal_memory.sh $database_dir
 ```
 
-* Characteristics:
-    * High memory requirements
-    * Fast performance
+-   Characteristics:
+    -   High memory requirements
+    -   Fast performance
 
 Command Help
-```sh
+
+``` sh
 ./target/release/kun_peng direct -h
 Directly load all hash tables for classification annotation
 
@@ -226,16 +290,16 @@ Options:
           Print version
 ```
 
-2. Chunk Processing Mode:
-
-* Description: This mode processes the sample data in chunks, loading only a small portion of the database files at a time. This reduces the memory requirements, needing a minimum of 4GB of memory plus the size of one pair of sample files.
-* Characteristics:
-    * Low memory consumption
-    * Slower performance compared to Direct Processing Mode
+2.  Chunk Processing Mode:
 
+-   Description: This mode processes the sample data in chunks, loading only a small portion of the database files at a time. This reduces the memory requirements, needing a minimum of 4GB of memory plus the size of one pair of sample files.
+-   Characteristics:
+    -   Low memory consumption
+    -   Slower performance compared to Direct Processing Mode
 
 Command Help
-```sh
+
+``` sh
 ./target/release/kun_peng classify -h
 Integrates 'splitr', 'annotate', and 'resolve' into a unified workflow for sequence classification. classify a set of sequences
 
@@ -279,10 +343,53 @@ Options:
           Print version
 ```
 
-3. Step-by-Step Processing Mode:
+3.  Step-by-Step Processing Mode:
+
+-   Description: This mode breaks down the chunk processing mode into individual steps, providing greater flexibility in managing the entire classification process.
+-   Characteristics:
+    -   Flexible processing steps
+    -   Similar memory consumption to Chunk Processing Mode
+    -   Performance varies based on execution steps
+
+### Output
+
+-   test_out/output_1.txt：
+
+Standard Kraken Output Format:
+
+1.  “C”/“U”: a one letter code indicating that the sequence was either classified or unclassified.
+2.  The sequence ID, obtained from the FASTA/FASTQ header.
+3.  The taxonomy ID Kraken 2 used to label the sequence; this is 0 if the sequence is unclassified.
+4.  The length of the sequence in bp. In the case of paired read data, this will be a string containing the lengths of the two sequences in bp, separated by a pipe character, e.g. “98\|94”.
+5.  A space-delimited list indicating the LCA mapping of each k-mer in the sequence(s). For example, “562:13 561:4 A:31 0:1 562:3” would indicate that:
+    -   the first 13 k-mers mapped to taxonomy ID #562
+    -   the next 4 k-mers mapped to taxonomy ID #561
+    -   the next 31 k-mers contained an ambiguous nucleotide
+    -   the next k-mer was not in the database
+    -   the last 3 k-mers mapped to taxonomy ID #562
+    Note that paired read data will contain a “`|:|`” token in this list to indicate the end of one read and the beginning of another.
+
+-   test_out/output_1.kreport2：
+
+```         
+100.00  1   0   R   1   root
+100.00  1   0   D   10239     Viruses
+100.00  1   0   D1  2559587     Riboviria
+100.00  1   0   O   76804         Nidovirales
+100.00  1   0   O1  2499399         Cornidovirineae
+100.00  1   0   F   11118             Coronaviridae
+100.00  1   0   F1  2501931             Orthocoronavirinae
+100.00  1   0   G   694002                Betacoronavirus
+100.00  1   0   G1  2509511                 Sarbecovirus
+100.00  1   0   S   694009                    Severe acute respiratory syndrome-related coronavirus
+100.00  1   1   S1  2697049                     Severe acute respiratory syndrome coronavirus 2
+```
+
+Sample Report Output Formats:
 
-* Description: This mode breaks down the chunk processing mode into individual steps, providing greater flexibility in managing the entire classification process.
-* Characteristics:
-    * Flexible processing steps
-    * Similar memory consumption to Chunk Processing Mode
-    * Performance varies based on execution steps
+1.  Percentage of fragments covered by the clade rooted at this taxon
+2.  Number of fragments covered by the clade rooted at this taxon
+3.  Number of fragments assigned directly to this taxon
+4.  A rank code, indicating (U)nclassified, (R)oot, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies. Taxa that are not at any of these 10 ranks have a rank code that is formed by using the rank code of the closest ancestor rank with a number indicating the distance from that rank. E.g., “G2” is a rank code indicating a taxon is between genus and species and the grandparent taxon is at the genus rank.
+5.  NCBI taxonomic ID number
+6.  Indented scientific name
diff --git a/docs/KunPeng.png b/docs/KunPeng.png