ThermoMPNN is a graph neural network (GNN) trained using transfer learning to predict changes in stability for protein point mutants.
For details on ThermoMPNN training and methodology, please see the accompanying paper.
For a user-friendly version of ThermoMPNN requiring no installation, use this Colab notebook.
To install ThermoMPNN, first clone this repository
git clone https://github.com/Kuhlman-Lab/ThermoMPNN.git
Then use the file environment.yaml
install the necessary python dependencies (I recommend using mamba for convenience):
mamba env create -f environment.yaml
This will create a conda environment called thermoMPNN
.
NOTE: installing from the .yaml file may cause pytorch to install incorrectly (CPU version instead of GPU version).
Alternate installation instructions:
# create and activate environment
mamba create -n thermoMPNN python=3.10
mamba activate thermoMPNN
# install pytorch related packages
mamba install pytorch torchvision torchaudio pytorch-cuda=11.8 pytorch-lightning -c nvidia -c pytorch -c conda-forge
# install all other packages
mamba install joblib omegaconf pandas numpy tqdm mmseqs2 wandb biopython -c bioconda -c conda-forge -c anaconda
# check for proper GPU recognition (should return True)
python -c 'import torch; print(torch.cuda.is_available())'
There are a few different ways to run inference with ThermoMPNN all located in the analysis
directory.
The simplest way is to use the custom_inference.py
script to pass a custom PDB to ThermoMPNN for site-saturation mutagenesis.
For larger batches of predictions, it is recommended to set up a CustomDataset object by inheriting from the ddgBenchDataset class in the datasets.py
file, then add this dataset to the SSM.py
script to get aggregated predictions for the whole dataset.
The thermompnn_benchmarking.py
is set up to score different models on a CustomDataset object or one of the datasets used in this study. An example inference SLURM script is provided at examples/inference.sh
.
The main training script is train_thermompnn.py
. To set up a training run, you must write a config.yaml
file (example provided) to specify model hyperparameters. You also must provide a local.yaml
file to tell ThermoMPNN where to find your data. These files serve as experiment logs as well.
Training ThermoMPNN requires the use of a GPU. On a small dataset (<5000 data points), training takes <30s per epoch, while on a mega-scale dataset (>200,000 data points), it takes 8-12min per epoch (on a single V100 GPU). An example training SLURM script is provided at examples/train.sh
.
For the purpose of replication and future benchmarking, the dataset splits used in this study are included as .pkl
files under the dataset_splits/
directory.
ThermoMPNN model weights can be found in the models/
directory. The following model weights are provided:
- thermoMPNN_default.pt (best ThermoMPNN model trained on Megascale training dataset)
The datasets used in this study can be found in the following locations:
Fireprot: https://doi.org/10.5281/zenodo.8169288 Megascale: https://doi.org/10.5281/zenodo.7401274 S669: https://doi.org/10.1093/bib/bbab555 SSYM, P53, Myoglobin, etc: https://protddg-bench.github.io