This study presents a replication attempt in C++ of the study by Phoophakdee Benjarath, and Mohammed J. Zaki., on Genome-scale disk-based suffix tree indexing. Our goal was to replicate the four-stage of researchers from scratch and evaluate their original findings. However, we encountered challenges due to the limited guidance for its implementation. As a result, instead of a one-to-one replication of the study, we decided to implement our interpretation of the algorithm, this solution while functional and able to replicate the algorithm’s main stages, had its own limitations and was unable to be utilized for genome-scale sequences. This experience showcases the importance of thoroughness in reporting research to ensure the reproducibility and reliability of findings.
An in-depth technical view can be accessed at ./Project Report/Technical_Report.pdf
, we provide some main figures below for ease of access:
Genetic sequence Suffix Tree example:
Algorithm stages:
Suffix Tree and Node class variables:
Algorithm 1, Variable length prefix creation:
Algorithm 2, Sequence partitioning:
Algorithm 3, Suffix tree construction:
Algorithm 4, Querying partition tree for prefix:
Algorithm 5, Tree merging:
- *.h files are the header files for the corresponding .cpp files to be included in the main.cpp file.
- file_utils contains functions for creating, deleting folders to store the output files of each stage and for partitioning the input sequence into smaller files.
- node contains the class constructor and respective methods for the nodes of the suffix tree.
- suffix_tree contains the class constructor and respective methods for the suffix tree.
- prefix_count contains the methods required for the variable lenght prefix stage.
- suffix_links contains the methods required for the suffix links stage.
- run_pipeline contains the four stages of the algorithm in distinct functions and the main function that calls them in the right order.
- TRELLIS.exe is the executable file of the algorithm
- ./Project Report contains the technical report of the project.
- ./input contains the input sequence files for the algorithm.
- ./temp_prfx contains the identified variable length prefixes.
- ./temp_partition contains the partitioned input sequence files.
- ./temp_trees contains the prefixed suffix sub-trees of the partitioned input for each prefix.
- ./temp_trees/merged_trees contains the merged suffix trees of the partitioned input for each prefix.
- ./temp_trees/final_trees contains the merged trees with the addition of suffix links.
- Make sure an
./input
directory exists in the same directory as the executable fileTRELLIS.exe
. - The Sars Cov-2 genome sequence should be in the
./input
directory with the nameNC_045512V2.fa
. - Run the executable file
TRELLIS.exe
.
If you wish to compile the algorithm the command utilized is:
g++ -fdiagnostics-color=always -g -ggdb -pedantic-errors -Wall -Wextra -Wsign-conversion -Werror -std=c++20 *.cpp -o /home/cinnamonk/projects/Trelis_repo/TRELLIS.exe