Reaction outcome prediction using SELFIES
This Repository containes our work for the project 2 of the EPFL course CS433 in machinelearning. We carried this project out as part of the ML4Science initiative, that allows students to join a research lab and work on a practical ML problem. This problem was kindly provided by Philippe Schwaller who also advised us over the course of the project.
In this work, we have retrained a transformer based model for chemical reaction prediction using string-based molecular representations. We compaire the established SMILES representation with the recently developped SELFIES representation, that performs well in generative models.
In transformer based chemical reaction prediction, the prediction problem is treated as a NLP translation task, where the input or educts are treated as the language that is to be translated, and the reaction products are the desired translations.
The repository is structured into subdirectories containing raw data utilised for training models /raw_data
, all pretrained models mentioned in the report /pretrained_models
, reactant translation results on the validation set /results
, config files for onmt models /run
, utilised functions /src/selfiespredict
. A example notebook for generating data, training a model and evaluating predictions is included at the root Tokenize_Train_Evaluate.ipynb
. The report and figures are included in /report
.
Not all data utilised in the report is included due to size limitations. Tokenized SMILES data is included in /tokenized_data
. All other data may be generated by functions in /src/selfiespredict/data
.
Functions are split into data (/src/selfiespredict/data
), evaluation (/src/selfiespredict/evaluation
) and helpers (/src/selfiespredict/helpers
). The data_load file in /src/selfiespredict/data
contains the data_loader class for downloading raw data (import_data
) and generating the tokenized data we used for our models (gen_txt
/ gen_SMILE_tokenized_SELFIES
/ gen_SELFIEandSMILES
). It also includes functions for converting the string representations into eachother and for tokenization. The errormetrics
file in /src/selfiespredict/helpers
yields the top1 or top5 accuracy used for evaluating the models. The Helper_Functions
file in /src/selfiespredict/helpers
contains the SMILES tokenizer.
The reaction prediction problem that was introduced in the report (fig.1) can be solved with the pretrained SMILES model that we have included in the repository:
onmt_translate -verbose -model ./pretrained_models/SMILES_250K_pretrained.pt \ --src ./pretrained_models/testreaction.txt \ --output predicted_reaction_outcome.txt \ --n_best 1 --beam_size 5 --max_length 300 --batch_size 1
The predicted outcome confirms what a trained chemist can derive: The educts have undergone a nucleophilic substitution reaction. The by- product of the reaction, hydrochloric acid (H-Cl), is not predicted by the model as it only is trained on predicting the main product.
We recommend first creating a virtual environment:
conda create --name selfies_project conda activate selfies_project
The code can be installed by first cloning the repository and then running pip locally:
git clone <link> cd <./cloned_repository> pip install . --user
On Windows, the rdkit wheel might not work and git/setuptools might not be installed, for that we recommend the rdkit-install that is proposed by the rdkit developpers. If git is not installed on windows, an unclear "missing setuptools" error is thrown:
conda activate selfies_project #make sure to uninstall the not-working pypi wheel pip uninstall rdkit-pypi conda install -c rdkit rdkit
And then install into the environment:
pip install . --user
It might be possible that the setup file has to be run seperately. Due to the limited time of the project, we were not able to identify why this is nescessary on google colab:
python setup.py install
To run the tests, run in the selfies directory. This may take a while:
python -m unittest
The model is based on the Carbohydrate Transformer and the ONMT-py translation tool to properly cite the two, please include:
@article{pesciullesi2020transfer, title={Transfer learning enables the molecular transformer to predict regio-and stereoselective reactions on carbohydrates}, author={Pesciullesi, Giorgio and Schwaller, Philippe and Laino, Teodoro and Reymond, Jean-Louis}, journal={Nature Communications}, volume={11}, number={1}, pages={1--8}, year={2020}, publisher={Nature Publishing Group} } @inproceedings{opennmt, author = {Guillaume Klein and Yoon Kim and Yuntian Deng and Jean Senellart and Alexander M. Rush}, title = {Open{NMT}: Open-Source Toolkit for Neural Machine Translation}, booktitle = {Proc. ACL}, year = {2017}, url = {https://doi.org/10.18653/v1/P17-4012}, doi = {10.18653/v1/P17-4012} }