This project has been made for Machine Learning Corse at University of Florence.
The prediction of proteins 3D structure's properties is one of the most popular and studied issue in bioinformatics and Machine Learning.
Nowdays the most frequent approch is using a Recurrent Neural Network, like a Bidirectional-LSTM based Model (see Linked articles
).
Using a Convolutional Neural Network is a good compromise between performance and execution time.
More details in the pdf presentation.
Even if the a RNN may capture more distant dependencies between the proteins amino acids, using a CNN will make sharply descend the execution time. So the idea is using a CNN alone or combining it with a RNN (ispired from [1]), for capturing both local and gloabal dependencies between the features.
The project, realized with Keras, consists in a single ipynb file: you can open it with Google Colab or in you local notebook. Anyway i highly suggest to use a GPU... There are also many data preprocessing classes and scripts:
- generate_sample.py: Generate Random Samples from DSSP extracted from PDB (CullPDB folder).
Note: you have to put the PDB files in data/cullpdb/pdbs/
For example you can download the pdbs archive CullPDB - prepare_dataset.py: Script to generate train, validate and test set
- proteinStructureParser.py: Parser of dataset metadata
- PDBParser.py: PDB Parser class, used for conversion of PDB to DSSP to features
- Amino.py,Protein.py: classes maybe utils for data analysis
You will need:
- Python 3.x
- Keras
- Numpy
- CNN knowledge and time for the model training..
The CNN reached 85% of accuracy, while the CNN-LSTM model obtained more than 89% using CullPDB as dataset and training the model in 120 epochs, even if the second one appear to be yet improvable.
- S.K.Snderby, O. Winther. Protein secondary structure prediction with long short term memory networks. arXiv preprint arXiv:1412.7828, 2014.
- A. R. Johansen, S.K.Snderby, O. Winther,Protein and secondary structure prediction with convolutions and vertical-bi-directional rnns, DTU, 2016