Skip to content

bananenpampe/selfiespredict

Repository files navigation


SelfiesPredict

Reaction outcome prediction using SELFIES

This Repository containes our work for the project 2 of the EPFL course CS433 in machinelearning. We carried this project out as part of the ML4Science initiative, that allows students to join a research lab and work on a practical ML problem. This problem was kindly provided by Philippe Schwaller who also advised us over the course of the project.

In this work, we have retrained a transformer based model for chemical reaction prediction using string-based molecular representations. We compaire the established SMILES representation with the recently developped SELFIES representation, that performs well in generative models.

In transformer based chemical reaction prediction, the prediction problem is treated as a NLP translation task, where the input or educts are treated as the language that is to be translated, and the reaction products are the desired translations.

Repository structure

The repository is structured into subdirectories containing raw data utilised for training models /raw_data, all pretrained models mentioned in the report /pretrained_models, reactant translation results on the validation set /results, config files for onmt models /run, utilised functions /src/selfiespredict. A example notebook for generating data, training a model and evaluating predictions is included at the root Tokenize_Train_Evaluate.ipynb. The report and figures are included in /report.

Not all data utilised in the report is included due to size limitations. Tokenized SMILES data is included in /tokenized_data. All other data may be generated by functions in /src/selfiespredict/data.

Functions

Functions are split into data (/src/selfiespredict/data), evaluation (/src/selfiespredict/evaluation) and helpers (/src/selfiespredict/helpers). The data_load file in /src/selfiespredict/data contains the data_loader class for downloading raw data (import_data) and generating the tokenized data we used for our models (gen_txt / gen_SMILE_tokenized_SELFIES / gen_SELFIEandSMILES). It also includes functions for converting the string representations into eachother and for tokenization. The errormetrics file in /src/selfiespredict/helpers yields the top1 or top5 accuracy used for evaluating the models. The Helper_Functions file in /src/selfiespredict/helpers contains the SMILES tokenizer.

Example

reaction_prediction.png

  • The reaction prediction problem that was introduced in the report (fig.1) can be solved with the pretrained SMILES model that we have included in the repository:

    onmt_translate -verbose -model ./pretrained_models/SMILES_250K_pretrained.pt \
    --src ./pretrained_models/testreaction.txt \
    --output predicted_reaction_outcome.txt \
    --n_best 1 --beam_size 5 --max_length 300 --batch_size 1
    
  • The predicted outcome confirms what a trained chemist can derive: The educts have undergone a nucleophilic substitution reaction. The by- product of the reaction, hydrochloric acid (H-Cl), is not predicted by the model as it only is trained on predicting the main product.

reaction_prediction_filled_out.png

Install notes

  • We recommend first creating a virtual environment:

    conda create --name selfies_project
    conda activate selfies_project
    
  • The code can be installed by first cloning the repository and then running pip locally:

    git clone <link>
    cd <./cloned_repository>
    pip install . --user
    
  • On Windows, the rdkit wheel might not work and git/setuptools might not be installed, for that we recommend the rdkit-install that is proposed by the rdkit developpers. If git is not installed on windows, an unclear "missing setuptools" error is thrown:

    conda activate selfies_project
    #make sure to uninstall the not-working pypi wheel
    pip uninstall rdkit-pypi
    conda install -c rdkit rdkit
    
  • And then install into the environment:

    pip install . --user
    
  • It might be possible that the setup file has to be run seperately. Due to the limited time of the project, we were not able to identify why this is nescessary on google colab:

    python setup.py install
    

Tests

  • To run the tests, run in the selfies directory. This may take a while:

    python -m unittest
    

Citations

  • The model is based on the Carbohydrate Transformer and the ONMT-py translation tool to properly cite the two, please include:

    @article{pesciullesi2020transfer,
      title={Transfer learning enables the molecular transformer to predict regio-and stereoselective reactions on carbohydrates},
      author={Pesciullesi, Giorgio and Schwaller, Philippe and Laino, Teodoro and Reymond, Jean-Louis},
      journal={Nature Communications},
      volume={11},
      number={1},
      pages={1--8},
      year={2020},
      publisher={Nature Publishing Group}
    }
    
    @inproceedings{opennmt,
      author    = {Guillaume Klein and
                   Yoon Kim and
                   Yuntian Deng and
                   Jean Senellart and
                   Alexander M. Rush},
      title     = {Open{NMT}: Open-Source Toolkit for Neural Machine Translation},
      booktitle = {Proc. ACL},
      year      = {2017},
      url       = {https://doi.org/10.18653/v1/P17-4012},
      doi       = {10.18653/v1/P17-4012}
    }
    

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published