DeepMol

Description

DeepMol is a Python-based machine and deep learning framework for drug discovery. It offers a variety of functionalities that enable a smoother approach to many drug discovery and chemoinformatics problems. It uses Tensorflow, Keras, Scikit-learn and DeepChem to build custom ML and DL models or make use of pre-built ones. It uses the RDKit framework to perform operations on molecular data.

Requirements

Installation

Pip

Install DeepMol via pip:

If you intend to install all the deepmol modules' dependencies:

pip install deepmol[all]

Extra modules:

pip install deepmol[preprocessing]
pip install deepmol[machine_learning]
pip install deepmol[deep_learning]

Also, you should install mol2vec and its dependencies:

pip install git+https://github.com/samoturk/mol2vec#egg=mol2vec

Manually

Alternatively, clone the repository and install the dependencies manually:

Clone the repository:

git clone https://github.com/BioSystemsUM/DeepMol.git

Install dependencies:

python setup.py install

Getting Started

DeepMol is built in a modular way allowing the use of its methods for multiple tasks. It offers a complete workflow to perform ML and DL tasks using molecules represented as SMILES. It has modules that perform standard tasks such as the loading and standardization of the data, computing molecular features like molecular fingerprints, performing feature selection and data splitting. It also provides methods to deal with unbalanced datasets, do unsupervised exploration of the data and compute feature importance as shap values.

The DeepMol framework is still under development, and it is currently at a pre-release version. New models and features will be added in the future.

Load a dataset from a CSV

For now, it is only possible to load data directly from CSV files. Modules to load data from different file types and sources will be implemented in the future. These include JSON, SDF and FASTA files and directly from our databases.

To load data from a CSV it's only required to provide the math and molecules field name. Optionally, it is also possible to provide a field with some ids, the labels fields, features fields, features to keep (useful for instance to select only the features kept after feature selection) and the number of samples to load (by default loads the entire dataset).

from deepmol.loaders.loaders import CSVLoader

# load a dataset from a CSV (required fields: dataset_path and smiles_field)
loader = CSVLoader(dataset_path='../../data/train_dataset.csv',
                   smiles_field='mols',
                   id_field='ids',
                   labels_fields=['y'],
                   features_fields=['feat_1', 'feat_2', 'feat_3', 'feat_4'],
                   shard_size=1000,
                   mode='auto')

dataset = loader.create_dataset()

# print shape of the dataset (molecules, X, y)
dataset.get_shape()

((1000,), None, (1000,))

Compound Standardization

It is possible to standardize the loaded molecules using three option. Using a basic standardizer that only does sanitization (Kekulize, check valencies, set aromaticity, conjugation and hybridization). A more complex standardizer can be customized by choosing or not to perform specific tasks such as sanitization, remove isotope information, neutralize charges, remove stereochemistry and remove smaller fragments. Another possibility is to use the ChEMBL Standardizer.

# Option 1: Basic Standardizer
standardizer = BasicStandardizer().standardize(dataset)

# Option 2: Custom Standardizer
heavy_standardisation = {
    'REMOVE_ISOTOPE': True,
    'NEUTRALISE_CHARGE': True,
    'REMOVE_STEREO': True,
    'KEEP_BIGGEST': True,
    'ADD_HYDROGEN': True,
    'KEKULIZE': False,
    'NEUTRALISE_CHARGE_LATE': True}
standardizer2 = CustomStandardizer(heavy_standardisation).standardize(dataset)

# Option 3: ChEMBL Standardizer
standardizer3 = ChEMBLStandardizer().standardize(dataset)

Compound Featurization

It is possible to compute multiple types of molecular fingerprints like Morgan Fingerprints, MACCS Keys, Layered Fingerprints, RDK Fingerprints and AtomPair Fingerprints. Featurizers from DeepChem and molecular embeddings like the Mol2Vec can also be computed. More complex molecular embeddings like the Seq2Seq and transformer-based are in development and will be added soon.

from deepmol.compound_featurization import MorganFingerprint

# Compute morgan fingerprints for molecules in the previous loaded dataset
MorganFingerprint(radius=2, size=1024).featurize(dataset)
# view the computed features (dataset.X)
dataset.X

#print shape of the dataset to see difference in the X shape
dataset.get_shape()

((1000,), (1000, 1024), (1000,))

Feature Selection

Regarding feature selection it is possible to do Low Variance Feature Selection, KBest, Percentile, Recursive Feature Elimination and selecting features based on importance weights.

from deepmol.feature_selection import LowVarianceFS

# Feature Selection to remove features with low variance across molecules
LowVarianceFS(0.15).select_features(dataset)

# print shape of the dataset to see difference in the X shape (fewer features)
dataset.get_shape()

((1000,), (1000, 35), (1000,))

Unsupervised Exploration

It is possible to do unsupervised exploration of the datasets using PCA, tSNE, KMeans and UMAP.

from deepmol.unsupervised import UMAP

ump = UMAP()
umap_df = ump.run_unsupervised(dataset)
ump.plot(umap_df.X, path='umap_output.png')

Data Split

Data can be split randomly or using stratified splitters. K-fold split, train-test split and train-validation-test split can be used.

from deepmol.splitters.splitters import SingletaskStratifiedSplitter

# Data Split
splitter = SingletaskStratifiedSplitter()
train_dataset, valid_dataset, test_dataset = splitter.train_valid_test_split(dataset=dataset, frac_train=0.7,
                                                                             frac_valid=0.15, frac_test=0.15)
train_dataset.get_shape()

((1628,), (1628, 1024), (1628,))

valid_dataset.get_shape()

((348,), (348, 1024), (348,))

test_dataset.get_shape()

((350,), (350, 1024), (350,))

Build, train and evaluate a model

It is possible use pre-built models from Scikit-Learn and DeepChem or build new ones using keras layers. Wrappers for Scikit-Learn, Keras and DeepChem were implemented allowing evaluation of the models under a common workspace.

Scikit-Learn model example

Models can be imported from scikit-learn and wrapped using the SKlearnModel module.

Check this jupyter notebook for a complete example!

from sklearn.ensemble import RandomForestClassifier
from deepmol.models.sklearn_models import SklearnModel

# Scikit-Learn Random Forest
rf = RandomForestClassifier()
# wrapper around scikit learn models
model = SklearnModel(model=rf)
# model training
model.fit(train_dataset)

from deepmol.metrics.metrics import Metric
from deepmol.metrics.metrics_functions import roc_auc_score

# cross validate model on the full dataset
model.cross_validate(dataset, Metric(roc_auc_score), folds=3)

from sklearn.metrics import precision_score, accuracy_score, confusion_matrix, classification_report

#evaluate the model using different metrics
metrics = [Metric(roc_auc_score), Metric(precision_score), Metric(accuracy_score), Metric(confusion_matrix), 
           Metric(classification_report)]

# evaluate the model on training data
print('Training Dataset: ')
train_score = model.evaluate(train_dataset, metrics)

# evaluate the model on training data
print('Validation Dataset: ')
valid_score = model.evaluate(valid_dataset, metrics)

# evaluate the model on training data
print('Test Dataset: ')
test_score = model.evaluate(test_dataset, metrics)

Keras model example

Example of how to build and wrap a keras model using the KerasModel module.

Check this jupyter notebook for a complete example!

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from deepmol.metrics.metrics import Metric

input_dim = train_dataset.X.shape[1]


def create_model(optimizer='adam', dropout=0.5, input_dim=input_dim):
  # create model
  model = Sequential()
  model.add(Dense(12, input_dim=input_dim, activation='relu'))
  model.add(Dropout(dropout))
  model.add(Dense(8, activation='relu'))
  model.add(Dense(1, activation='sigmoid'))
  # Compile model
  model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
  return model


from deepmol.models.keras_models import KerasModel

model = KerasModel(create_model, epochs=5, verbose=1, optimizer='adam')

# train model
model.fit(train_dataset)

# make prediction on the test dataset with the model
model.predict(test_dataset)

# evaluate model using multiple metrics
metrics = [Metric(roc_auc_score),
           Metric(precision_score),
           Metric(accuracy_score),
           Metric(confusion_matrix),
           Metric(classification_report)]

print('Training set score:', model.evaluate(train_dataset, metrics))
print('Test set score:', model.evaluate(test_dataset, metrics))

DeepChem model example

Using DeepChem models:

Check this jupyter notebook for a complete example!

from deepmol.compound_featurization import WeaveFeat
from deepchem.models import MPNNModel
from deepmol.models.deepchem_models import DeepChemModel
from deepmol.metrics.metrics import Metric
from deepmol.splitters.splitters import SingletaskStratifiedSplitter

ds = WeaveFeat().featurize(dataset)
splitter = SingletaskStratifiedSplitter()
train_dataset, valid_dataset, test_dataset = splitter.train_valid_test_split(dataset=ds, frac_train=0.6, frac_valid=0.2,
                                                                             frac_test=0.2)
mpnn = MPNNModel(n_tasks=1, n_pair_feat=14, n_atom_feat=75, n_hidden=75, T=1, M=1, mode='classification')
model_mpnn = DeepChemModel(mpnn)
# Model training
model_mpnn.fit(train_dataset)
valid_preds = model_mpnn.predict(valid_dataset)
test_preds = model_mpnn.predict(test_dataset)
# Evaluation
metrics = [Metric(roc_auc_score), Metric(precision_score), Metric(accuracy_score)]
print('Training Dataset: ')
train_score = model_mpnn.evaluate(train_dataset, metrics)
print('Valid Dataset: ')
valid_score = model_mpnn.evaluate(valid_dataset, metrics)
print('Test Dataset: ')
test_score = model_mpnn.evaluate(test_dataset, metrics)

Hyperparameter Optimization

Grid and randomized hyperparameter optimization is provided using cross-validation or a held-out validation set.

from deepmol.parameter_optimization.hyperparameter_optimization import HyperparameterOptimizerValidation,

HyperparameterOptimizerCV

# Hyperparameter Optimization (using the above created keras model)
optimizer = HyperparameterOptimizerValidation(create_model)

params_dict = {'optimizer': ['adam', 'rmsprop'],
               'dropout': [0.2, 0.4, 0.5]}

best_model, best_hyperparams, all_results = optimizer.hyperparameter_search(params_dict, train_dataset,
                                                                            valid_dataset, Metric(roc_auc_score))

print(best_hyperparams)
print(best_model)

# Evaluate model
best_model.evaluate(test_dataset, metrics)

Feature Importance (Shap Values)

Explain the output of a machine learning model can be done using SHAP (SHapley Additive exPlanations) package. The features that most influenced (positively or negatively) a certain prediction can be calculated and visualized in different ways:

from deepmol.feature_importance import ShapValues

shap_calc = ShapValues(test_dataset, model)
shap_calc.computePermutationShap()

shap_calc.plotSampleExplanation(index=1, plot_type='waterfall')

shap_calc.plotFeatureExplanation(index=115)

Draw relevant features

It is possible to plot the ON bits (or some of them) in a molecule for MACCS Keys, Morgan and RDK Fingeprints. IT is also possible to draw those bits on the respective molecule. This can be allied with the Shap Values calculation to highlight the zone of the molecule that most contributed to a certain prediction, for instance, the substructure in the molecule that most contributed to its classification as an active or inactive molecule against a receptor.

from deepmol.utils.utils import draw_MACCS_Pattern

patt_number = 54
mol_number = 1

prediction = model.predict(test_dataset)[mol_number]
actual_value = test_dataset.y[mol_number]
print('Prediction: ', prediction)
print('Actual Value: ', actual_value)
smi = test_dataset.mols[mol_number]

draw_MACCS_Pattern(smi, patt_number)

Unbalanced Datasets

Multiple methods to deal with unbalanced datasets can be used to do oversampling, under-sampling or a mixture of both (Random, SMOTE, SMOTEENN, SMOTETomek and ClusterCentroids).

from deepmol.imbalanced_learn.imbalanced_learn import SMOTEENN

train_dataset = SMOTEENN().sample(train_dataset)

About Us

DeepMol is managed by a team of contributors from the BioSystems group at the Centre of Biological Engineering, University of Minho.

This research was financed by Portuguese Funds through FCT – Fundação para a Ciência e a Tecnologia.

Citing DeepMol

Manuscript under preparation.

Publications using DeepMol

Baptista D., Correia J., Pereira B., Rocha M. (2022) "A Comparison of Different Compound Representations for Drug Sensitivity Prediction". In: Rocha M., Fdez-Riverola F., Mohamad M.S., Casado-Vara R. (eds) Practical Applications of Computational Biology & Bioinformatics, 15th International Conference (PACBB 2021). PACBB 2021. Lecture Notes in Networks and Systems, vol 325. Springer, Cham. https://doi.org/10.1007/978-3-030-86258-9_15

Baptista, Delora, Correia, João, Pereira, Bruno and Rocha, Miguel. "Evaluating molecular representations in machine learning models for drug response prediction and interpretability" Journal of Integrative Bioinformatics, vol. 19, no. 3, 2022, pp. 20220006. https://doi.org/10.1515/jib-2022-0006

J. Capela, J. Correia, V. Pereira and M. Rocha, "Development of Deep Learning approaches to predict relationships between chemical structures and sweetness," 2022 International Joint Conference on Neural Networks (IJCNN), 2022, pp. 1-8, doi: 10.1109/IJCNN55064.2022.9891992. https://ieeexplore.ieee.org/abstract/document/9891992

Licensing

DeepMol is under BSD-2-Clause License.

Name		Name	Last commit message	Last commit date
Latest commit History 713 Commits
.github/workflows		.github/workflows
Docker		Docker
docs		docs
examples		examples
src/deepmol		src/deepmol
tests		tests
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
extra-requirements.txt		extra-requirements.txt
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
requirements_dev.txt		requirements_dev.txt
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeepMol

Description

Table of contents:

Requirements

Installation

Pip

Manually

Getting Started

Load a dataset from a CSV

Compound Standardization

Compound Featurization

Feature Selection

Unsupervised Exploration

Data Split

Build, train and evaluate a model

Scikit-Learn model example

Keras model example

DeepChem model example

Hyperparameter Optimization

Feature Importance (Shap Values)

Draw relevant features

Unbalanced Datasets

About Us

Citing DeepMol

Publications using DeepMol

Licensing

About

Releases

Packages

Languages

License

Ganeshdhanawade/DeepMol

Folders and files

Latest commit

History

Repository files navigation

DeepMol

Description

Table of contents:

Requirements

Installation

Pip

Manually

Getting Started

Load a dataset from a CSV

Compound Standardization

Compound Featurization

Feature Selection

Unsupervised Exploration

Data Split

Build, train and evaluate a model

Scikit-Learn model example

Keras model example

DeepChem model example

Hyperparameter Optimization

Feature Importance (Shap Values)

Draw relevant features

Unbalanced Datasets

About Us

Citing DeepMol

Publications using DeepMol

Licensing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages