Protein Secondary Structure Prediction with Deep Learning

This is a deep learning architecture to predict secondary structure in proteins. The dataset, originally from the Protein Data Bank (PDB), contains amino acid sequence and structure information for roughly 6100 proteins.

Currently, we use a Bidirectional LSTM RNN architecture to solve the Q8 classification problem; for each amino acid in the protein sequences, we assign one of eight different labels:

alpha helix
beta strand
loop or irregular
beta turn
bend
3₁₀-helix
beta bridge
pi helix

The neural network architecture yields a roughly 56% Q8 accuracy in testing. (Note: these scripts are not optimized for running on GPUs.)

Python Requirements

The required Python modules are in requirements.txt. You can install them with

pip install -r requirements.txt

Running the Scripts

To run the current version of the algorithm, you can run the following command:

python src/driver.py data/cullpdb+profile_6133.npy

where the last argument is the location of the dataset. We use the publicly available dataset from Zhou, J. & Troyanskaya O. 2014. Currently, this data is accessible here: http://www.princeton.edu/~jzthree/datasets/ICML2014/

Add the [-c] flag to see the 8x8 confusion matrix of the labels on the validation data after each epoch of training. For more information about using the driver script, run the driver with the help flag:

python src/driver.py -h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Protein Secondary Structure Prediction with Deep Learning

Python Requirements

Running the Scripts

Files

README.md

Latest commit

History

README.md

File metadata and controls

Protein Secondary Structure Prediction with Deep Learning

Python Requirements

Running the Scripts