MolGen-Transformer is a transformer-based generative AI model designed for molecular generation and latent space exploration, specifically targeting π-conjugated molecules. By leveraging a latent-space-centered approach and a SELFIES-based molecular representation, MolGen-Transformer ensures 100% molecular reconstruction accuracy, enabling robust and reliable generation of chemically meaningful molecules.
New! MolGen-Transformer is trained with the OCELOT Plus dataset, providing an expanded dataset for more comprehensive coverage of π-conjugated molecules.
This repository provides the MolGen-Transformer model, sampling methods, and analysis tools for generative molecular design, facilitating AI-driven chemical discovery, structure optimization, and property-based molecular exploration.
This work is described in detail in our paper:
MolGen-Transformer: A Molecule Language Model for the Generation and Latent Space Exploration of π-Conjugated Molecules available on ChemRxiv.
MolGen-Transformer addresses major challenges in generative molecular AI, including chemical space coverage, latent space interpretability, and generation reliability, by implementing the following capabilities:
- Diverse Molecular Generation: Randomly samples molecules from latent space to ensure diverse structural outputs.
- Controlled Molecular Generation: Allows similarity-controlled generation for tuning molecular diversity and resemblance.
- Molecular Interpolation: Identifies intermediate structures between two molecules, aiding in reaction pathway discovery.
- Local Molecular Generation: Enables the refinement and optimization of molecules by manipulating latent space vectors locally.
- Neighboring Search: Iteratively searches neighboring molecules to optimize a given molecular property using a multi-fidelity model.
- Molecular Evolution: Evolves molecules along a path in latent space, allowing progressive optimization from a starting molecule to a target structure.
- SMILES & SELFIES Conversion: Smiles and Selfies Conversion**: Converts SMILES to latent space representations and vice versa.
- Trained on: ~198 million organic molecules
- Latent Space Encoding: SELFIES representation for guaranteed chemical validity
- Computation Mode: GPU acceleration supported for efficient molecular generation
- Output Storage: Customizable report save path for logs and results
- GPU Mode: Enables computations on a GPU to speed up processing.
- Report Save Path: Specifies the directory for saving outputs and logs.
- GenerateMethods Usage Guide
- Installation
- Quick Start
- Configuration Files
- Addition Information and Notes for Installation
git clone https://github.com/baskargroup/MolTransformer_repo.git
cd MolTransformer_repo
If you don’t already have conda, install Miniconda or Anaconda.
Note: For systems without conda (e.g., using micromamba), see the Additional Information and Notes.
Create and activate a new environment:
conda create -n moltransformer python=3.9
conda activate moltransformer
-
Install RDKit:
conda install -c conda-forge rdkit
-
Install PyTorch:
If you have an NVIDIA GPU:
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
(Adjust the cudatoolkit version based on your hardware.)
For CPU-only PyTorch:
conda install pytorch torchvision torchaudio cpuonly -c pytorch
-
Install Remaining Dependencies:
pip install -r requirements.txt
Install MolTransformer as an editable package (for development and latest updates):
pip install -e .
Or directly via pip (stable release):
pip install moltransformer
Verify your installation by running:
import MolTransformer
print(MolTransformer.__file__)
If there are no errors, your setup is complete!
This example demonstrates how to generate a set number of molecular structures randomly across the latent space. Note that the number of unique molecules may be less than requested due to potential duplicates.
from MolTransformer import GenerateMethods
GM = GenerateMethods(save=True) # Set `save=True` to save results and logs
smiles_list, selfies_list = GM.global_molecular_generation(n_samples=100)
Generates new molecules around an initial molecule by exploring its local latent space neighborhood. By default, it saves generated results and provides an option to select top-k closest molecules.
Parameters:
initial_smile
(str, optional): The SMILES string of your reference molecule. If omitted, a random molecule is selected from the dataset.report_save_path
(str, optional): Path to save generated results. Default path:output/GenerateMethods/
.
from MolTransformer import GenerateMethods
GM = GenerateMethods(save=True)
generated_results = GM.local_molecular_generation(
dataset='qm9', random_initial_smile=False, initial_smile='C1=CC=CC=C1', num_vector=30)
print("Generated SMILES:", generated_results['SMILES'])
print("Generated SELFIES:", generated_results['SELFIES'])
Performs minimal perturbations using a binary search in the latent space around a provided molecule, generating closely related molecular variants. Ideal for iterative exploration and optimization of molecular structures. The perturbation magnitude can be controlled with parameters.
Parameters:
initial_smile
(str): SMILES string of the reference molecule (required).search_range
(float, optional): Maximum perturbation range (default:10.0
).resolution
(float, optional): Perturbation resolution (default:0.1
).report_save_path
(str, optional): Path to save generated results.
from MolTransformer import GenerateMethods
GM = GenerateMethods(save=True, report_save_path='./output/NeighborSearch/')
initial_smile = 'C1=CC=CC=C1' # Benzene example
generated_results, fail_cases = GM.neighboring_search(initial_smile=initial_smile, num_vector=20)
print('Generated SMILES:', generated_results['SMILES'])
print('Generated SELFIES:', generated_results['SELFIES'])
This example shows how to manually manipulate the latent space representation of a molecule to explore structural variations.
Example usage:
from MolTransformer import GenerateMethods
# Initialize generator
GM = GenerateMethods()
# Select or define an initial SMILES molecule
initial_smile = GM.random_smile(dataset='qm9')
print('Initial SMILE:', initial_smile)
# Convert SMILES to latent space
latent_space = GM.smiles_2_latent_space([initial_smile])
print('Latent Space Shape:', latent_space.shape)
# Manually modify latent space here if desired
# Example: latent_space += np.random.normal(0, 0.1, latent_space.shape)
# Convert the modified latent space back to SMILES/SELFIES
edit_results = GM.latent_space_2_strings(latent_space)
print('Edited SMILE:', edit_results['SMILES'][0])
print('Edited SELFIES:', edit_results['SELFIES'][0])
The molecular_evolution
function generates intermediate molecules along the latent space pathway connecting two specified molecules.
Parameters:
start_molecule
(str): SMILES string of the initial molecule.end_molecule
(str): SMILES string of the target molecule.number
(int): Number of intermediate molecules generated.
The function saves results automatically if save=True
, including generated molecules, similarity scores, and visualizations.
Example Usage:
from MolTransformer import GenerateMethods
# Initialize with output saving enabled
GM = GenerateMethods(report_save_path='/path/to/save/reports/', save=True)
# Define starting and target molecules (SMILES)
start_molecule = 'c1ccccc1' # Benzene (example)
end_molecule = 'C1CCCCC1' # Cyclohexane example
# Generate intermediate molecules
results_df = GM.molecular_evolution(start_molecule, end_molecule, number=100)
# Display generated molecules and similarities
print(results_df[['SMILES', 'distance_ratio', 'similarity_start', 'similarity_end']])
To customize and effectively utilize the MolGen-Transformer examples provided, edit the configuration parameters in the following files:
-
gpu_mode
(bool, default: false):Controls whether MolGen-Transformer uses parallel GPU computation.
- If
true
, the package runs in distributed parallel mode across multiple GPUs. Set totrue
only if parallel GPU execution is explicitly required. - If
false
(default), MolGen-Transformer automatically detects if CUDA is available and runs in single-GPU mode or CPU if GPU is unavailable.
- If
-
output_folder_name
(str):Specifies the directory name for saving generated outputs and reports.
Example:
{
"gpu_mode": false,
"output_folder_name": "generated_molecules"
}
-
model_mode
(str): Determines the model used for generation. Default is"SS"
(self-supervised). -
gpu_mode
(bool): Same functionality as described above (config.json
). -
batch_size
(int): Specifies batch size for molecular generation and inference. -
report_save_path
(str): Path for saving logs and generated reports. -
model_save_folder
(str): Directory for storing trained model checkpoints. -
data_path
(dict): Specifies file paths for user-provided datasets for training/testing. Required only if using custom data.
Example:
{
"model_mode": "SS",
"gpu_mode": false,
"batch_size": 64,
"report_save_path": "./output/reports",
"model_save_folder": "./output/models",
"data_path": {
"train": [["train.csv"]],
"test": [["test.csv"]]
}
}
Adjust these configurations based on your computational resources and experimental needs.
Some HPC systems remove or disable conda, but provide micromamba as a lightweight conda-like tool. To install MolTransformer under micromamba, do:
- Load micromamba:
module purge
module load micromamba
eval "$(micromamba shell hook --shell=bash)"
- Create and activate a new environment:
micromamba create -p /path/to/my_moltransformer_env python=3.9 -c conda-forge
micromamba activate /path/to/my_moltransformer_env
If your cluster only offers Python 3.13, you may see a warning about the removed imp module. We’ve patched hostlist in our code to avoid this issue, but be aware that older versions of hostlist or ansiblecmdb might still reference imp.
- Install GPU PyTorch:
micromamba install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
- Install RDKit (if needed):
micromamba install -c conda-forge rdkit
- Option 1: Install from the repository:
pip install -r requirements.txt
pip install -e .
- Option 2: Install via PyPI
pip install moltransformer
python -c "import MolTransformer; print(MolTransformer.__file__)"
Python 3.13 fully removes the old imp module. If you see a ModuleNotFoundError: No module named 'imp', it usually means a library (e.g., ansiblecmdb, hostlist) hasn’t updated to importlib. We’ve patched our references to hostlist so it no longer imports imp. If you still encounter problems, make sure you’re on our latest codebase. If your cluster forcibly uses Python 3.13, you might need to manually patch or remove libraries that still depend on imp. Alternatively, if your HPC environment allows it, use Python 3.12 or earlier to avoid this issue entirely.
If the error occur to your systm, please do the following step:
- Open it in a text editor:
nano /work/mech-ai/bella/my_moltransformer_env_py312/lib/python3.13/site-packages/ansiblecmdb/render.py
- Remove or replace the line:
import imp
to:
import importlib