Single-cell Annotation and Fusion with Adversarial Open-Set Domain Adaptation Reliable for single-cell multi-omics Data Integration
SAFAARI is a Single-cell Annotation and Fusion with Adversarial Open-Set Domain Adaptation Reliable for single-cell multi-omics Data Integration. It effectively removes batch effects, adapts to new cell types, and improves cross-modality single-cell analysis. It supports both open-set and closed-set annotation.
- Open-Set & Closed-Set Adaptation: Handles novel cell types in the target dataset.
- Batch Effect Removal: Uses adversarial learning to mitigate batch effects.
- Class Imbalance Handling: Uses SMOTE oversampling to balance training data.
- Novel Cell Type Detection: Identifies unknown cell types in new datasets.
- Cross-Modality/Cross-species Integration
To install SAFAARI, run the following commands:
git clone https://github.com/VafaeeLab/SAFAARI.git
cd SAFAARI
pip install -e .
You can run SAFAARI using the command-line interface (CLI):
safaari-supervised_integration
closed set mode:
safaari-unsupervised --open_set False
open set mode:
safaari-unsupervised --open_set True
If you want to obtain the Supervised Integration Results for the following datasets:
- Ovary (RNA-ATAC)
- PBMC (ADT-SCT)
- SEURAT_PBMC (RNA-ATAC)
You can download the required datasets from the following link: Dataset Download
After downloading, you need to modify main_integration.py
and uncomment the relevant dataset section to run the integration process.
You can specify additional parameters:
safaari-unsupervised --open_set True --epochs 500 --batch_size 512 --cuda 0
SAFAARI expects input data in CSV format, where:
- Each row represents a single cell
- Each column represents a gene
- The first column must be labeled
'cell types'
and contain cell-type annotations
SAFAARI utilizes a subset of the Tabula Muris cell atlas, containing seven tissues:
- Bladder, Kidney, Heart, Mammary Gland, Muscle, Bone Marrow, Spleen
Data from FACS (Fluorescence-Activated Cell Sorting) serves as the source domain, while 10x Genomics is the target domain. The structured datasets are stored in: data/FACS10X/{Tissue_name} Each dataset follows the naming convention:
dataset_source_CmnGenes.csv
(e.g.,Bladder_FACS_CmnGenes.csv
)dataset_target_CmnGenes.csv
(e.g.,Bladder_10X_CmnGenes.csv
)
These files contain normalized gene expression values across common highlu variable genes between domains.
Unsupervised results will be stored in data/results/{Tissue_name}/
:
source_embeddings.csv
(Processed source domain features)target_embeddings.csv
(Processed target domain features)FACS_to_10X_labels.csv
orFACS_to_10X_labels_op.csv
(Open vs. Closed set labels)
Supervised integration results will be stored in SAFAARI_supervised_Integration_Resultsdata/{Tissue_name}/
:
dataset_FACS_to_10X_embeddings_op_source.csv
(supervised source embeddings)dataset_FACS_to_10X_embeddings_op_target.csv
(supervised target embeddings)
safaari-unsupervised --open_set True
safaari-supervised_integration
If you use SAFAARI in your research, please cite:
@article{Aminzadeh2024SAFAARI,
author = {Fatemeh Aminzadeh, Jun Wu, Jingrui He, Morteza Saberi, Fatemeh Vafaee},
title = {Single-Cell Data Integration and Cell Type Annotation through Contrastive Adversarial Open-set Domain Adaptation},
journal = {bioRxiv},
year = {2024},
doi = {10.1101/2024.10.04.616599},
publisher = {Cold Spring Harbor Laboratory}
}
You can also find the preprint at: https://doi.org/10.1101/2024.10.04.616599.