Evaluating the role of pre-training dataset size and diversity on single-cell foundation model performance
This repository contains the code that accompanies our paper, "Evaluating the role of pre-training dataset size and diversity on single-cell foundation model performance". You can find the preprint of the paper here.
In this project, we assess three model architectures pre-trained to perform as foundation models in the context of single-cell RNA-seq: scVI, SSL, and Geneformer. We pre-trained these models on subsets of the scTab corpus using three different downsampling schemes (uniform random downsampling, cell type re-weighting, and geometric sketching) and evaluated these models in (1) the zero-shot regime and (2) when fine-tuned.
Our evaluation uses two main tasks: cell type classification and batch integration. In these tasks, we compare the performance of scVI, SSL, and Geneformer against simple baselines and investigate the role of pre-training dataset size and diversity on downstream performance.
First install python
package dependencies (this can take 15+ minutes)
pip install -r requirements.txt
Install our fork of ssl_in_scg
(this should only take about 2 minutes)
git clone https://github.com/v-mahughes/ssl_in_scg
cd ssl_in_scg
git fetch
git switch early-stopping
pip install -e .
Install our fork of Geneformer
(this should only take 20 seconds)
git clone https://github.com/lcrawlab/Geneformer
cd Geneformer
pip install .
Install zero-shot-scfoundation
(this should only take 10 seconds)
git clone https://github.com/microsoft/zero-shot-scfoundation
cd zero-shot-scfoundation
pip install .
The instructions for downloading the scTab corpus are in the data/preprocess
directory.
The instructions for downampling the scTab corpus to generate pre-training datasets are in the downsampling
directory.
The instructions for pre-training all models are in the pretraining
directory. Each model architecture has its own directory.
The instructions for pre-training all models are in the finetuning
directory. Each model architecture has its own directory.
The instructions for evaluating all models are in the eval
directory. There are scripts for both zero-shot and fine-tuned evaluations.
Jupyter notebooks that produce each of the figures (after running all model evaluations) are in the plotting
directory.
If you have any questions, or find any issues with the code, please open an issue in this repository. We also welcome any contributions to the code - be sure to checkout the Contributing section below.
If you have questions or concerns with this project and do not want to create an issue, please contact Alan DenAdel, Ava Amini, or Lorin Crawford. Any feedback on the software, manuscript, and tutorials is appreciated.
@article {DenAdel2024.12.13.628448,
author = {DenAdel, Alan and Hughes, Madeline and Thoutam, Akshaya and Gupta, Anay and Navia, Andrew W. and Fusi, Nicolo and Raghavan, Srivatsan and Winter, Peter S. and Amini, Ava P. and Crawford, Lorin},
title = {Evaluating the role of pre-training dataset size and diversity on single-cell foundation model performance},
elocation-id = {2024.12.13.628448},
year = {2024},
doi = {10.1101/2024.12.13.628448},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2024/12/17/2024.12.13.628448},
eprint = {https://www.biorxiv.org/content/early/2024/12/17/2024.12.13.628448.full.pdf},
journal = {bioRxiv}
}
This project is available under the MIT License.
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.