CoMBCR is an innovative B-cell embedding method designed to integrate multi-modal data from B cells, particularly BCRs and gene expressions, within a co-learning framework. By accepting paired BCR sequences and gene expression profiles as input, CoMBCR effectively integrates these two modalities to produce joint representations for each B cell, focusing specifically on the heavy chain of BCRs.
CoMBCR is implemented in Python and requires a GPU for the acceleration.
We recommend the versions of the following packages:
- Pytorch (2.4.1)
- Transformers (4.41.2)
- Numpy (1.26.4)
- Pandas (2.2.3)
- Scikit-learn (1.5.1)
- huggingface_hub by
python3 -m pip install huggingface_hub
Install CoMBCR using pip:
pip3 install CoMBCR
Then, install the default pre-trained encoder (The code only need to be executed once when install CoMBCR):
from CoMBCR.utils import download_BCRencoder
download_BCRencoder()
We provide a tutorial for the usage of CoMBCR.
CoMBCR integrates BCRs and gene expressions but requires three files: a BCR sequences file, a gene expression file, and a file containing BCR embeddings generated by a BCR encoder (e.g., AntiBERTa, ESM2).
- Ensure each file includes an index column labeled "barcode," serving as a unique identifier for each cell.
- Verify that the cells are aligned in the same order across all three files.
This CSV file should include an index column named "barcode" and columns labeled "fwr1", "cdr1", "fwr2", "cdr2", "fwr3", "cdr3" and "fwr4". The file should resemble the example shown below:
Normalization and log-transformation are recommended. Batch effect removal is advisable if applicable. We suggest using the top 5,000 highly variable genes, though you can select input genes according to your criteria.
Please clone or download the "runberta.py" in this github. This file is used to measure the original distances between BCRs. We recommend using our default pre-trained encoder, though any encoder can be used to encode BCRs.
python3 runberta.py --datapath "exampledata/example_bcr.csv" --outdir "example_outdir" --outfilename "antiberta_embedding.csv"
The code generates an original BCR embedding file named "antiberta_embedding.csv" under the outdir.
To quickly run CoMBCR, use the following code:
from CoMBCR.CoMBCR import CoMBCR_main bcremb, gexemb = CoMBCR_main(bcrpath="exampledata/example_bcr.csv", rnapath="exampledata/example_rna.csv", bcroriginal="exampledata/example_bcrori.csv", outdir="example_outdir", epochs=1, batch_size=32, encoderprofile_in_dim=5000)
This code returns numpy arrays for BCR embeddings and gene expression embeddings, and outputs "bcrembedding.csv" and "gexembedding.csv" in the specified output directory.
Please note that these CSV files directly store the numpy arrays and, as such, do not include any "barcode" column. When reading these files, ensure that you do not specify any index column.
Parameter Description bcrpath (Required) The path to the BCR sequences file. rnapath (Required) The path to the gene expression file. bcroriginal (Required) The path to the BCR original embedding file. outdir (Required) The directory where the best checkpoint file and the output embeddings will be stored. checkpoint Default is "best_network.pth". This parameter specifies the name of the file where the best model checkpoint will be saved. lr Default is 1e-5. lam Default is 1e-1, the inner parameter (Parameter alpha in the paper). batch_size Default is 256. epochs Default is 200. patience Default is 15, the patience for early stopping. lr_step Default is [30,100]. These are the milestones for the MultiStepLR setting, which adjusts the learning rate at specified epochs. encoderprofile_in_dim Default is 5000. Adjust this parameter if the number of input genes differs from 5000. separatebatch The default is False. If set to True, BCRs from different samples will be treated as distinct BCRs. Ensure that your BCR input file contains a "sample" column if you choose to enable this option.
The code was based in part on the source code of UniTCR.
If you encounter issues installing or using CoMBCR, please feel free to open a issue or contact me via email.