Website: Nextflow Graph Machine Learning
A Nextflow pipeline demonstrating how to train graph neural networks for gene regulatory network reconstruction using DREAM5 data.
- Nextflow Graph Machine Learning
- Table of contents
- Introduction
- The Nextflow pipeline
- Python Environment
- ArangoDB
The purpose of this project is to provide a simple demonstration of how to construct a Nextflow pipeline, with MLOps integration, for performing gene regulatory network (GRN) reconstruction using graph neural networks (GNNs). In practice, GRN reconstruction is an unsupervised link prediction problem.
For developing GNNs, we use PyTorch Geometric.
Nextflow has been included to orchestrate the GRN reconstruction pipeline.
The pipeline is composed of the following steps:
- Exploratory data analysis: View the GRN and calculate some summary statistics.
- Processing: Process the graph feature matrix and edge list. Remove the disconnected subgraph.
- ArangoDB Importing: Import the graph into ArangoDB.
- GNN training: Train a GNN using SAGE convolutional layers.
- GNN training: Train a variational autoencoder GNN, and save the neural embeddings.
Run nextflow.sh to execute the full pipeline.
Run clean_nf.sh to clean up the output and logging files from the Nextflow run.
Python dependencies are specified in this requirements.txt file..
These dependencies are installed during the build process for the following Docker image: ghcr.io/jbris/nextflow-graph-machine-learning:1.0.0
Execute the following command to pull the image: docker pull ghcr.io/jbris/nextflow-graph-machine-learning:1.0.0
- A Docker compose file has been provided to launch an MLOps stack.
- See the .env file for Docker environment variables.
- The docker_up.sh script can be executed to launch the Docker services.
- DVC is included for data version control.
- MLFlow is available for experiment tracking.
- MinIO is available for storing experiment artifacts.