Skip to content

Machine Learning Project for building a Convolutional Neural Network for the Proteins Solvent Accessibility Prediction

License

Notifications You must be signed in to change notification settings

fabridigua/ML-1D-Convolutional-NN-for-Proteins-Solvent-Accessibility-Prediction

Repository files navigation

ML 1D Convolutional NN for Proteins Solvent Accessibility Prediction

This project has been made for Machine Learning Corse at University of Florence.

The Problem

The prediction of proteins 3D structure's properties is one of the most popular and studied issue in bioinformatics and Machine Learning.
Nowdays the most frequent approch is using a Recurrent Neural Network, like a Bidirectional-LSTM based Model (see Linked articles ).
Using a Convolutional Neural Network is a good compromise between performance and execution time.
More details in the pdf presentation.

The Idea

Even if the a RNN may capture more distant dependencies between the proteins amino acids, using a CNN will make sharply descend the execution time. So the idea is using a CNN alone or combining it with a RNN (ispired from [1]), for capturing both local and gloabal dependencies between the features.

The Project

The project, realized with Keras, consists in a single ipynb file: you can open it with Google Colab or in you local notebook. Anyway i highly suggest to use a GPU... There are also many data preprocessing classes and scripts:

  1. generate_sample.py: Generate Random Samples from DSSP extracted from PDB (CullPDB folder).
    Note: you have to put the PDB files in data/cullpdb/pdbs/
    For example you can download the pdbs archive CullPDB
  2. prepare_dataset.py: Script to generate train, validate and test set
  3. proteinStructureParser.py: Parser of dataset metadata
  4. PDBParser.py: PDB Parser class, used for conversion of PDB to DSSP to features
  5. Amino.py,Protein.py: classes maybe utils for data analysis

The Dependencies

You will need:

  • Python 3.x
  • Keras
  • Numpy
  • CNN knowledge and time for the model training..

The Results

The CNN reached 85% of accuracy, while the CNN-LSTM model obtained more than 89% using CullPDB as dataset and training the model in 120 epochs, even if the second one appear to be yet improvable.

The linked articles

  1. S.K.Snderby, O. Winther. Protein secondary structure prediction with long short term memory networks. arXiv preprint arXiv:1412.7828, 2014.
  2. A. R. Johansen, S.K.Snderby, O. Winther,Protein and secondary structure prediction with convolutions and vertical-bi-directional rnns, DTU, 2016

About

Machine Learning Project for building a Convolutional Neural Network for the Proteins Solvent Accessibility Prediction

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published