Our cUps algorithm is for classifying the malaria var genes into ups groups using DBLα sequences. It takes as input a reference database of DBLα sequences with known ups groups and a set of DBLα sequences to be classified, and outputs the probabilities of membership to all three ups groups per sequence.
-
Fasta format DBLα sequences to be classified (refer to query sequences).
-
Reference data which consist of (1) reference DBLα sequences, each sequence is annotated with DBLα subclass and ups group; (2) profile HMM for each reference category (combination of DBLα subclass and ups group). Please see reference_data folder for details.
Our algorithm produces four csv files based on various algorithmic stages, and places them in the directory specified by output.
-
Log-likelihood tables (middle output). For each query sequence, its likelihood for being drawn from the profile HMM under each category is computed.
-
Classification result table (final output). Each row starts with the sequence ID and shows the probabilities of membership to three ups groups. An example table is as following.
A | B | C | |
---|---|---|---|
seq1 | 0.98 | 0.01 | 0.01 |
seq2 | 0.01 | 0.04 | 0.95 |
... | ... | ... | ... |
- R
- Python >=3.5
git clone https://github.com/qianfeng2/cUps
cd cUps
pip install -r requirements.txt
mkdir your_output_dir
python scripts/generate_llk.py query_data/example.fasta your_output_dir
Rscript scripts/classify_upsABC.R your_output_dir
The python script is to generate the log-likelihood of query sequences for every profile HMM of reference categories. This script outputs three tables (PAD.csv, PBD.csv and PCD.csv) representating the results under upsA, upsB and upsC group. For every table, each row refers to a query sequence, each column refers to a DBLα subclass, the entry is the log-likelihood value.
The R script is to calculate the posterior probabilities for each query sequence. The reference data is replaceable. The final result is shown at classification_result.csv.
As a toy example, results folder provides all the middle and final output files for the example.fasta stored in query_data folder.
Since our algorithm processes each sequence independenly, it would be time-demanding if there are many sequences in the data. Here we recommend to split the big dataset into subsets and run our algorithm for each subset. With the help of HPC, you can even run all subsets at the same time.
Below I show steps for how to split a data with 1000 sequences (example_bigdata.fasta in query_data folder) and run the subsets simultaneously.
cd query_data
mkdir example_bigdata_split_files
cd ../
python scripts/split_input.py query_data/example_bigdata.fasta query_data/example_bigdata_split_files
cd results
mkdir example_bigdata
cd example_bigdata
for run in $(seq 1 1 4);do mkdir run_$run;done
cd ../../
I use job array in HPC to run the subsets in parallel.
python scripts/generate_llk.py query_data/example_bigdata_split_files/input_run${SLURM_ARRAY_TASK_ID}.fasta results/example_bigdata/run_${SLURM_ARRAY_TASK_ID}
Rscript scripts/classify_upsABC.R results/example_bigdata/run_${SLURM_ARRAY_TASK_ID}
The results for each subset are shown in results/example_bigdata folder.
-
For the format of sequence identifier, although the blank space is not allowed inside the identifier, punctuations (e.g., "|", ";", "_", "-") generally work well with our scripts.
-
Our algorithm only accepts the protein sequences as input. If your data is DNA, please translate it.
-
Always run your data inside the cUps directory, as our algorithm relies on the information (priors, profile HMMs) from reference_data directory.
This algorithm is developed by Qian Feng, Heejung Shim and Yao-ban Chan at the University of Melbourne. For any problems, please report an issue in Github or send an email to Qian Feng.
-
Thomas S Rask, Daniel A Hansen, Thor G Theander,Anders Gorm Pedersen, and Thomas Lavstsen. Plasmodium falciparum erythrocyte membrane protein 1 diversity in seven genomes – divide and conquer. PLoS Comput. Biol., 6(9):e1000933, 2010
-
Anders Krogh, Michael Brown, I Saira Mian, Kimmen Sjölander, and David Haussler. Hidden Markov models in computational biology: Applications to protein modeling. J. Mol. Biol., 235(5):1501–1531, 1994
-
Richard Durbin, Sean R Eddy, Anders Krogh, and Graeme Mitchison. Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge university press, 1998