Molecular Property Prediction Pipeline

A machine learning pipeline for predicting molecular properties using Graph Neural Networks (GNN) based on molecular structure data from XYZ files and additional features. Features automated hyperparameter optimization using Weights & Biases (wandb).

Overview

This pipeline processes molecular structure data from XYZ files along with additional features from a CSV file to predict specified molecular properties. It uses a Graph Neural Network approach to handle the varying sizes and structures of molecules, with automated hyperparameter optimization and experiment tracking.

Features

Processes molecular XYZ files for structural information
Integrates additional molecular features from CSV files
Converts molecular structures to graph representations
Automated hyperparameter optimization using wandb
Multiple GNN architectures (GCN, GAT, GraphConv)
Configurable target property prediction
Support for both CPU and GPU training
Parallel data processing and caching
Comprehensive logging and metric tracking
Progress visualization with wandb
Early stopping and learning rate scheduling

Project Structure

molecular_ml/
│
├── data/
│   ├── raw/
│   │   ├── xyz_files/    # Place your .xyz files here
│   │   └── features.csv  # Place your features file here
│   └── processed/        # Processed data and cache
│       └── cache/        # Zarr cache storage
│
├── src/
│   ├── data/            # Data loading and processing
│   │   ├── __init__.py
│   │   └── loader.py    # XYZ and CSV data loading
│   ├── features/        # Feature engineering
│   │   ├── __init__.py
│   │   └── featurizer.py # Graph creation and featurization
│   ├── models/          # ML models
│   │   ├── __init__.py
│   │   └── model.py     # GNN model implementations
│   └── utils/           # Utility functions
│       ├── __init__.py
│       └── logger.py    # Logging and W&B integration
│
├── configs/             # Configuration files
│   └── config.yaml     # Main configuration file
├── main.py             # Main execution script
├── requirements.txt    # Project dependencies
└── README.md          # This file

Installation

Clone the repository:

git clone [repository-url]
cd molecular_ml

Create and activate a virtual environment (recommended):

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate.bat

Install dependencies:

pip install -r requirements.txt

Set up Weights & Biases:

wandb login

Data Preparation

Place your XYZ files in data/raw/xyz_files/
Place your features CSV file in data/raw/features.csv

The features CSV file should contain:

An 'ID' column matching XYZ filenames (format: '000001.xyz')
Feature columns
The target property column

Configuration

The config.yaml file contains all settings including model architecture, training parameters, and hyperparameter sweep configurations.

Example configuration:

# Data paths
xyz_dir: "data/raw/xyz_files"
features_file: "data/raw/features.csv"
target_column: "Formation_E"
output_dir: "output"
cache_dir: "data/processed/cache"

# Wandb configuration
wandb:
  project: "molecular-property-prediction"
  entity: "your-username"
  tags: ["GNN", "molecular-properties"]
  sweep:
    method: "bayes"
    metric:
      name: "val_rmse"
      goal: "minimize"
    parameters:
      conv_type:
        values: ["GCN", "GAT", "GraphConv"]
      num_layers:
        values: [2, 3, 4, 5]
      hidden_channels:
        values: [32, 64, 128, 256]

# Model and training parameters...

Usage

Single Training Run

python main.py --config configs/config.yaml

Hyperparameter Optimization

python main.py --config configs/config.yaml --sweep --count 50

Monitoring Training

Navigate to your wandb project page: https://wandb.ai/[username]/[project-name]
View real-time training metrics, parameter importance, and results visualization

Model Architecture

The pipeline supports multiple GNN architectures:

Graph Convolutional Networks (GCN)
Graph Attention Networks (GAT)
General Graph Convolution (GraphConv)

Features:

Configurable number of layers
Residual connections
Batch normalization
Dropout regularization
Global pooling
Multi-layer prediction head

Advanced Features

Caching

Processed molecular graphs are cached using Zarr
Significantly speeds up subsequent runs
Configurable cache location

Parallel Processing

Multi-threaded data loading
Parallel graph construction
GPU acceleration when available

Hyperparameter Optimization

Bayesian optimization using wandb
Configurable parameter spaces
Automated tracking of all experiments
Parameter importance visualization

Logging and Monitoring

Comprehensive logging to files
Real-time metric tracking in wandb
Progress bars for all operations
Detailed error reporting

Requirements

Python 3.11+
PyTorch
PyTorch Geometric
Weights & Biases
Zarr
Additional requirements in requirements.txt

Contributing

Fork the repository
Create a feature branch
Commit your changes
Push to the branch
Create a Pull Request

Please ensure:

Code follows project structure
New features include tests
Documentation is updated
Commit messages are descriptive

License

[Your chosen license]

Contact

[Your contact information]

Acknowledgments

PyTorch Geometric team for the GNN implementations
Weights & Biases for experiment tracking
[Additional acknowledgments]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Molecular Property Prediction Pipeline

Overview

Features

Project Structure

Installation

Data Preparation

Configuration

Usage

Single Training Run

Hyperparameter Optimization

Monitoring Training

Model Architecture

Advanced Features

Caching

Parallel Processing

Hyperparameter Optimization

Logging and Monitoring

Requirements

Contributing

License

Contact

Acknowledgments

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
configs		configs
data		data
src		src
.gitignore		.gitignore
README.MD		README.MD
main.py		main.py
requirements.txt		requirements.txt

roupenminassian/molecular_ml

Folders and files

Latest commit

History

Repository files navigation

Molecular Property Prediction Pipeline

Overview

Features

Project Structure

Installation

Data Preparation

Configuration

Usage

Single Training Run

Hyperparameter Optimization

Monitoring Training

Model Architecture

Advanced Features

Caching

Parallel Processing

Hyperparameter Optimization

Logging and Monitoring

Requirements

Contributing

License

Contact

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages