Skip to content

Commit

Permalink
Write Getting started with MassSpecGym
Browse files Browse the repository at this point in the history
  • Loading branch information
roman-bushuiev committed Oct 28, 2024
1 parent 3d1f083 commit 1e9c293
Showing 1 changed file with 51 additions and 5 deletions.
56 changes: 51 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,13 +11,21 @@
<img src="https://raw.githubusercontent.com/pluskal-lab/MassSpecGym/5d7d58af99947988f947eeb5bd5c6a472c2938b7/assets/MassSpecGym_abstract.svg" width="80%"/>
</p>

MassSpecGym provides three challenges for benchmarking the discovery and identification of new molecules from MS/MS spectra. The provided challenges abstract the process of scientific discovery from biological and environmental samples into well-defined machine learning problems.
MassSpecGym provides three challenges for benchmarking the discovery and identification of new molecules from MS/MS spectra:

- 💥 ***De novo* molecular generation** (input - MS/MS spectrum, output - molecular structure)
- 🎆 **Bonus chemical formulae challenge** (input - MS/MS spectrum and chemical formula, output - molecular structure)
- 💥 **Molecular retrieval** (input - MS/MS spectrum, output - ranked list of candidate molecular structures)
- 🎆 **Bonus chemical formulae challenge** (input - MS/MS spectrum and chemical formula, output - ranked list of candidate molecular structures)
- 💥 **Spectrum simulation** (input - molecular structure, output - MS/MS spectrum)

The provided challenges abstract the process of scientific discovery from biological and environmental samples into well-defined machine learning problems with pre-defined datasets, data splits, and evaluation metrics.

<!-- [![Dataset on Hugging Face](https://huggingface.co/datasets/huggingface/badges/resolve/main/dataset-on-hf-md-dark.svg)](https://huggingface.co/datasets/roman-bushuiev/MassSpecGym) -->

📣 The paper will be available soon!

## Installation
## 📦 Installation

Installation is available via `pip`:

Expand All @@ -44,17 +52,51 @@ pip install massspecgym[notebooks, dev]
pip install -U torch==2.3.0 --index-url https://download.pytorch.org/whl/rocm6.0
``` -->

## MassSpecGym infrastructure
## 🍩 Getting started with MassSpecGym

<p align="center">
<img src="https://raw.githubusercontent.com/pluskal-lab/MassSpecGym/5d7d58af99947988f947eeb5bd5c6a472c2938b7/assets/MassSpecGym_infrastructure.svg" width="80%"/>
</p>

## Train and evaluate your model 🚀
MassSpecGym’s infrastructure consists of predefined components that serve as building blocks for the implementation and evaluation of new models.

First of all, the MassSpecGym dataset is available as a [Hugging Face dataset](https://huggingface.co/datasets/roman-bushuiev/MassSpecGym) and can be downloaded within the code into a pandas DataFrame as follows.

```python
from massspecgym.utils import load_massspecgym
df = load_massspecgym()
```

Second, MassSpecGym provides [a set of transforms](https://github.com/pluskal-lab/MassSpecGym/blob/main/massspecgym/data/transforms.py) for spectra and molecules, which can be used to preprocess data for machine learning models. These transforms can be applied alongside the `MassSpecDataset` class (or its subclasses), resulting in a PyTorch `Dataset` object that implicitly applies the specified transforms to each data point. Note that `MassSpecDataset` also downloads the dataset from the Hugging Face repository as needed.

```python
from massspecgym.data import MassSpecDataset
from massspecgym.transforms import SpecTokenizer, MolFingerprinter

dataset = MassSpecDataset(
spec_transform=SpecTokenizer(n_peaks=60),
mol_transform=MolFingerprinter(),
)
```

Third, MassSpecGym provides a `MassSpecDataModule`, a PyTorch Lightning [LightningDataModule](https://lightning.ai/docs/pytorch/stable/data/datamodule.html) that automatically handles data splitting into training, validation, and testing folds.

```python
from massspecgym.data import MassSpecDataModule

data_module = MassSpecDataModule(
dataset=dataset,
batch_size=32
)
```

Finally, MassSpecGym defines evaluation metrics by implementing abstract subclasses of `LightningModule` for each of the MassSpecGym challenges: [`DeNovoMassSpecGymModel`](https://github.com/pluskal-lab/MassSpecGym/blob/df2ff567ed5ad60244b4106a180aaebc3c787b7e/massspecgym/models/de_novo/base.py#L14), [`RetrievalMassSpecGymModel`](https://github.com/pluskal-lab/MassSpecGym/blob/df2ff567ed5ad60244b4106a180aaebc3c787b7e/massspecgym/models/retrieval/base.py#L14), and [`SimulationMassSpecGymModel`](https://github.com/pluskal-lab/MassSpecGym/blob/df2ff567ed5ad60244b4106a180aaebc3c787b7e/massspecgym/models/simulation/base.py#L12). To implement a custom model, you should inherit from the appropriate abstract class and implement the `forward` and `step` methods. This procedure is described in the next section. If you looking for more examples, please see the [`massspecgym/models`](https://github.com/pluskal-lab/MassSpecGym/tree/df2ff567ed5ad60244b4106a180aaebc3c787b7e/massspecgym/models) folder.

## 🚀 Train and evaluate your model

MassSpecGym allows you to implement, train, validate, and test your model with a few lines of code. Built on top of PyTorch Lightning, MassSpecGym abstracts data preparation and splitting while eliminating boilerplate code for training and evaluation loops. To train and evaluate your model, you only need to implement your custom architecture and prediction logic.

Below is an example of how to implement a simple model based on [DeepSets](https://arxiv.org/abs/1703.06114) for the molecule retrieval task. The model is trained to predict the fingerprint of a molecule from its spectrum and then retrieves the most similar molecules from a set of candidates based on fingerprint similarity. For more examples, please see `notebooks/demo.ipynb`.
Below is an example of how to implement a simple model based on [DeepSets](https://arxiv.org/abs/1703.06114) for the molecule retrieval task. The model is trained to predict the fingerprint of a molecule from its spectrum and then retrieves the most similar molecules from a set of candidates based on fingerprint similarity. For more examples, please see [`notebooks/demo.ipynb`](https://github.com/pluskal-lab/MassSpecGym/blob/df2ff567ed5ad60244b4106a180aaebc3c787b7e/notebooks/demo.ipynb).

1. Import necessary modules:

Expand Down Expand Up @@ -165,6 +207,10 @@ trainer.fit(model, datamodule=data_module)
trainer.test(model, datamodule=data_module)
```

## Submit your results to the leaderboard

TODO

## References

If you use MassSpecGym in your work, please cite the following paper:
Expand Down

0 comments on commit 1e9c293

Please sign in to comment.