This Jupyter Notebook demonstrates the use of Machine Learning and Deep Learning for drug discovery of a specific protein. We establish a pipeline that includes constructing a Random Forest (RF) regression model to predict the pIC50 values for the target’s chemical compounds, employing LSTMs to generate novel Simplified Molecular-Input Line Entry System (SMILES) strings, and utilizing the trained RF model to predict the pIC50 values for the generated strings.
Inspired by Chanin Nantasenamat's Computational Drug Discovery
- Dataset
- Bioactivity data from the ChEMBL database for Human Acetylcholinesterase (hAChE)
- 9,091 data points and 46 associated properties
- Preprocessing data:
- remove duplicates and missing values
- normalized IC50 values with negative log base 10
- calculate fingerprint descriptor using molecule ID and SMILES strings
- Methods
- Training Random Forest model
-
Apply nested cross-validation to find optimal parameters (5-fold outer loop and 3-fold inner loop)
-
Summary of nested cross-validation
k-fold Best parameters Best scores k=1 {'max_depth': 100, 'n_estimators': 1000} 0.58 k=2 {'max_depth': 90, 'n_estimators': 1000} 0.55 k=3 {'max_depth': 100, 'n_estimators': 1000} 0.46 k=4 {'max_depth': 100, 'n_estimators': 1000} 0.52 k=5 {'max_depth': 100, 'n_estimators': 1200} 0.52
-
- Training SMILES generator with 1,725 SMILES strings representing active molecules
- Training Random Forest model