GitHub

SelfiesPredict

Reaction outcome prediction using SELFIES

This Repository containes our work for the project 2 of the EPFL course CS433 in machinelearning. We carried this project out as part of the ML4Science initiative, that allows students to join a research lab and work on a practical ML problem. This problem was kindly provided by Philippe Schwaller who also advised us over the course of the project.

In this work, we have retrained a transformer based model for chemical reaction prediction using string-based molecular representations. We compaire the established SMILES representation with the recently developped SELFIES representation, that performs well in generative models.

In transformer based chemical reaction prediction, the prediction problem is treated as a NLP translation task, where the input or educts are treated as the language that is to be translated, and the reaction products are the desired translations.

Repository structure

The repository is structured into subdirectories containing raw data utilised for training models /raw_data, all pretrained models mentioned in the report /pretrained_models, reactant translation results on the validation set /results, config files for onmt models /run, utilised functions /src/selfiespredict. A example notebook for generating data, training a model and evaluating predictions is included at the root Tokenize_Train_Evaluate.ipynb. The report and figures are included in /report.

Not all data utilised in the report is included due to size limitations. Tokenized SMILES data is included in /tokenized_data. All other data may be generated by functions in /src/selfiespredict/data.

Functions

Functions are split into data (/src/selfiespredict/data), evaluation (/src/selfiespredict/evaluation) and helpers (/src/selfiespredict/helpers). The data_load file in /src/selfiespredict/data contains the data_loader class for downloading raw data (import_data) and generating the tokenized data we used for our models (gen_txt / gen_SMILE_tokenized_SELFIES / gen_SELFIEandSMILES). It also includes functions for converting the string representations into eachother and for tokenization. The errormetrics file in /src/selfiespredict/helpers yields the top1 or top5 accuracy used for evaluating the models. The Helper_Functions file in /src/selfiespredict/helpers contains the SMILES tokenizer.

Example

The reaction prediction problem that was introduced in the report (fig.1) can be solved with the pretrained SMILES model that we have included in the repository:

onmt_translate -verbose -model ./pretrained_models/SMILES_250K_pretrained.pt \
--src ./pretrained_models/testreaction.txt \
--output predicted_reaction_outcome.txt \
--n_best 1 --beam_size 5 --max_length 300 --batch_size 1

The predicted outcome confirms what a trained chemist can derive: The educts have undergone a nucleophilic substitution reaction. The by- product of the reaction, hydrochloric acid (H-Cl), is not predicted by the model as it only is trained on predicting the main product.

Install notes

We recommend first creating a virtual environment:

conda create --name selfies_project
conda activate selfies_project

The code can be installed by first cloning the repository and then running pip locally:
```
git clone <link>
cd <./cloned_repository>
pip install . --user
```
On Windows, the rdkit wheel might not work and git/setuptools might not be installed, for that we recommend the rdkit-install that is proposed by the rdkit developpers. If git is not installed on windows, an unclear "missing setuptools" error is thrown:
```
conda activate selfies_project
#make sure to uninstall the not-working pypi wheel
pip uninstall rdkit-pypi
conda install -c rdkit rdkit
```
And then install into the environment:
```
pip install . --user
```
It might be possible that the setup file has to be run seperately. Due to the limited time of the project, we were not able to identify why this is nescessary on google colab:
```
python setup.py install
```

Tests

To run the tests, run in the selfies directory. This may take a while:
```
python -m unittest
```

Citations

The model is based on the Carbohydrate Transformer and the ONMT-py translation tool to properly cite the two, please include:

@article{pesciullesi2020transfer,
  title={Transfer learning enables the molecular transformer to predict regio-and stereoselective reactions on carbohydrates},
  author={Pesciullesi, Giorgio and Schwaller, Philippe and Laino, Teodoro and Reymond, Jean-Louis},
  journal={Nature Communications},
  volume={11},
  number={1},
  pages={1--8},
  year={2020},
  publisher={Nature Publishing Group}
}

@inproceedings{opennmt,
  author    = {Guillaume Klein and
               Yoon Kim and
               Yuntian Deng and
               Jean Senellart and
               Alexander M. Rush},
  title     = {Open{NMT}: Open-Source Toolkit for Neural Machine Translation},
  booktitle = {Proc. ACL},
  year      = {2017},
  url       = {https://doi.org/10.18653/v1/P17-4012},
  doi       = {10.18653/v1/P17-4012}
}

Name		Name	Last commit message	Last commit date
Latest commit History 146 Commits
.ipynb_checkpoints		.ipynb_checkpoints
data		data
pretrained_models		pretrained_models
report		report
results		results
run		run
src/selfiespredict		src/selfiespredict
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
AUTHORS.rst		AUTHORS.rst
CONTRIBUTING.rst		CONTRIBUTING.rst
LICENSE.txt		LICENSE.txt
README.rst		README.rst
Tokenize_Train_Evaluate.ipynb		Tokenize_Train_Evaluate.ipynb
pyproject.toml		pyproject.toml
reaction_prediction.png		reaction_prediction.png
reaction_prediction_filled_out.png		reaction_prediction_filled_out.png
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SelfiesPredict

Repository structure

Functions

Example

Install notes

Tests

Citations

About

Releases

Packages

Contributors 2

Languages

License

bananenpampe/selfiespredict

Folders and files

Latest commit

History

Repository files navigation

SelfiesPredict

Repository structure

Functions

Example

Install notes

Tests

Citations

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages