Spiking Multi-Omics Transformer (MOT) Model for Pan-Cancer Classification

You can find here the project report

Introduction

The advent of high-throughput techniques has generated vast and diverse omics datasets, including genomics, transcriptomics, proteomics, metabolomics, and lipidomics. These datasets provide new opportunities for personalized medicine, allowing for a deeper understanding of patients' conditions. Traditionally, research has focused on single-omics studies, but there is a growing trend towards multi-omics approaches. Integrating multiple omics types offers a more comprehensive view, particularly in the study of complex diseases such as cancer, central nervous system disorders, and cardiovascular diseases.

This project leverages the Multi-Omics Transformer (MOT) architecture to classify multi-omics data. In addition, we explore the transformation of the MOT architecture into a spiking neural network, which aims to mimic the way neurons process information in the brain, to evaluate its performance compared to the traditional model.

Dataset Overview

The dataset used in this project is the TCGA pan-cancer dataset, which is publicly available on the UCSC Xena data portal. This dataset includes samples from 33 different tumor types and incorporates five distinct omics data types:

mRNA (RNA-Seq gene expression): Gene expression profiles with 20,532 gene identifiers, normalized through a log2 transformation.
DNA Methylation: Data derived from the Illumina Infinium Human Methylation BeadChip arrays, with 485,578 probes. (Not used in this project)
Copy Number Variations (CNVs): Profiles containing 24,776 identifiers representing various copy number alterations.
miRNA: A dataset consisting of 743 identifiers, also log2-transformed for normalization.
Protein Expression: Comprising 210 identifiers related to protein expression levels.

One of the primary challenges with this dataset, common to most omics data, is the imbalance in sample numbers across different tumor types. For instance, breast cancer is represented by over 1,200 samples, whereas cholangiocarcinoma has fewer than 50 samples.

Code structure

Here we present the most important code files with descriptions:

src/multiomic_modeling
- data_hdf5: folder used to contain dataset hdf5 file, initially empty then filled from the zip file.
- artifacts: folder used to save checkpoints during training.
- models
  - models.py: file containing base models.
  - base.py: base trainer's configuration (Wandb).
  - encoder.py: encoder module.
  - decoder.py: decoder module.
  - snn_transformer.py: implementation of snn version of transformer.py by pytorch.
  - trainer.py: implementation of snn training of the model.
- loss_and_metrics.py: file for metrics computation.
- neurobenchOmics: plugin implementation of neurobench for omics.

How to use the code

The first step is to fork the repository into your own project.

Wandb implementation: This isn't mandatory, but it allows real-time logging with weights and biases framework for loss and accuracy. To enable Wandb go in src->models->base.py and proceed with the following instruction:

uncomment from lines 120-131.
uncomment line 188.
uncomment line 257.

Following the google colab file:

Environment Setup: cloning the repository and installing the requirements.
Prepare the dataset.
Wandb: if enabled, enter API_key.
Imports and model params: importing the required libraries and setting the model parameters.
Download checkpoints and training: the training is started, possibly from a checkpoint.
Score and Test with neurobench: this is done in order to evaluate the model's performances.
Plots: same as before, we plot the neurons' activity.

Future Directions

The next steps involve optimizing the integration of different omics types and expanding the model’s capability to handle more complex and heterogeneous datasets. Further research will focus on enhancing the spiking neural network architecture to improve its real-time processing capabilities. Other future works are:

Increment model configuration settings.
Explore different hyperparameters sets.
Use more datasets to enanche the model capabilities.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.idea		.idea
images		images
src		src
test		test
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
MLProject.ipynb		MLProject.ipynb
README.md		README.md
READMEOLD.md		READMEOLD.md
ReportMOTtoSNN.pdf		ReportMOTtoSNN.pdf
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spiking Multi-Omics Transformer (MOT) Model for Pan-Cancer Classification

Introduction

Dataset Overview

Code structure

How to use the code

Future Directions

About

Releases

Packages

Languages

License

AndreaSillano/MotSNN

Folders and files

Latest commit

History

Repository files navigation

Spiking Multi-Omics Transformer (MOT) Model for Pan-Cancer Classification

Introduction

Dataset Overview

Code structure

How to use the code

Future Directions

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages