Skip to content

Latest commit

 

History

History
46 lines (36 loc) · 1.46 KB

README.md

File metadata and controls

46 lines (36 loc) · 1.46 KB

Protein Secondary Structure Prediction with Deep Learning

This is a deep learning architecture to predict secondary structure in proteins. The dataset, originally from the Protein Data Bank (PDB), contains amino acid sequence and structure information for roughly 6100 proteins.

Currently, we use a Bidirectional LSTM RNN architecture to solve the Q8 classification problem; for each amino acid in the protein sequences, we assign one of eight different labels:

  • alpha helix
  • beta strand
  • loop or irregular
  • beta turn
  • bend
  • 310-helix
  • beta bridge
  • pi helix

The neural network architecture yields a roughly 56% Q8 accuracy in testing. (Note: these scripts are not optimized for running on GPUs.)

Python Requirements

The required Python modules are in requirements.txt. You can install them with

pip install -r requirements.txt

Running the Scripts

To run the current version of the algorithm, you can run the following command:

python src/driver.py data/cullpdb+profile_6133.npy 

where the last argument is the location of the dataset. We use the publicly available dataset from Zhou, J. & Troyanskaya O. 2014. Currently, this data is accessible here: http://www.princeton.edu/~jzthree/datasets/ICML2014/

Add the [-c] flag to see the 8x8 confusion matrix of the labels on the validation data after each epoch of training. For more information about using the driver script, run the driver with the help flag:

python src/driver.py -h