CoMBCR

Introduction

CoMBCR is an innovative B-cell embedding method designed to integrate multi-modal data from B cells, particularly BCRs and gene expressions, within a co-learning framework. By accepting paired BCR sequences and gene expression profiles as input, CoMBCR effectively integrates these two modalities to produce joint representations for each B cell, focusing specifically on the heavy chain of BCRs.

Prerequisites

CoMBCR is implemented in Python and requires a GPU for the acceleration.

We recommend the versions of the following packages:

Pytorch (2.4.1)
Transformers (4.41.2)
Numpy (1.26.4)
Pandas (2.2.3)
Scikit-learn (1.5.1)
huggingface_hub by python3 -m pip install huggingface_hub

Installation

Install CoMBCR using pip:

pip3 install CoMBCR

Then, install the default pre-trained encoder (The code only need to be executed once when install CoMBCR):

from CoMBCR.utils import download_BCRencoder
download_BCRencoder()

Tutorial

We provide a tutorial for the usage of CoMBCR.

Usage

Prepare input data

CoMBCR integrates BCRs and gene expressions but requires three files: a BCR sequences file, a gene expression file, and a file containing BCR embeddings generated by a BCR encoder (e.g., AntiBERTa, ESM2).

Ensure each file includes an index column labeled "barcode," serving as a unique identifier for each cell.

Verify that the cells are aligned in the same order across all three files.
BCR sequences file

This CSV file should include an index column named "barcode" and columns labeled "fwr1", "cdr1", "fwr2", "cdr2", "fwr3", "cdr3" and "fwr4". The file should resemble the example shown below:

Gene expression file

Normalization and log-transformation are recommended. Batch effect removal is advisable if applicable. We suggest using the top 5,000 highly variable genes, though you can select input genes according to your criteria.

Original BCR embeddings file

Please clone or download the "runberta.py" in this github. This file is used to measure the original distances between BCRs. We recommend using our default pre-trained encoder, though any encoder can be used to encode BCRs.
python3 runberta.py --datapath "exampledata/example_bcr.csv" --outdir "example_outdir" --outfilename "antiberta_embedding.csv"
The code generates an original BCR embedding file named "antiberta_embedding.csv" under the outdir.
Quick run
To quickly run CoMBCR, use the following code:
from CoMBCR.CoMBCR import CoMBCR_main
bcremb, gexemb = CoMBCR_main(bcrpath="exampledata/example_bcr.csv", 
           rnapath="exampledata/example_rna.csv", 
           bcroriginal="exampledata/example_bcrori.csv", 
           outdir="example_outdir",
           epochs=1,
           batch_size=32,
           encoderprofile_in_dim=5000)
This code returns numpy arrays for BCR embeddings and gene expression embeddings, and outputs "bcrembedding.csv" and "gexembedding.csv" in the specified output directory.
Please note that these CSV files directly store the numpy arrays and, as such, do not include any "barcode" column. When reading these files, ensure that you do not specify any index column.
Parameters of CoMBCR

Parameter Description

bcrpath (Required) The path to the BCR sequences file.

rnapath (Required) The path to the gene expression file.

bcroriginal (Required) The path to the BCR original embedding file.

outdir (Required) The directory where the best checkpoint file and the output embeddings will be stored.

checkpoint Default is "best_network.pth". This parameter specifies the name of the file where the best model checkpoint will be saved.

lr Default is 1e-5.

lam Default is 1e-1, the inner parameter (Parameter alpha in the paper).

batch_size Default is 256.

epochs Default is 200.

patience Default is 15, the patience for early stopping.

lr_step Default is [30,100]. These are the milestones for the MultiStepLR setting, which adjusts the learning rate at specified epochs.

encoderprofile_in_dim Default is 5000. Adjust this parameter if the number of input genes differs from 5000.

separatebatch The default is False. If set to True, BCRs from different samples will be treated as distinct BCRs. Ensure that your BCR input file contains a "sample" column if you choose to enable this option.

Acknowledgements

The code was based in part on the source code of UniTCR.

Questions

If you encounter issues installing or using CoMBCR, please feel free to open a issue or contact me via email.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
BCRencoder		BCRencoder
exampledata		exampledata
images		images
src		src
LICENSE.md		LICENSE.md
README.md		README.md
runberta.py		runberta.py
setup.py		setup.py
tutorial.ipynb		tutorial.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CoMBCR

Introduction

Prerequisites

Installation

Tutorial

Usage

Prepare input data

BCR sequences file

Gene expression file

Original BCR embeddings file

Quick run

Parameters of CoMBCR

Acknowledgements

Questions

About

Releases

Packages

Languages

Parameter	Description
bcrpath	(Required) The path to the BCR sequences file.
rnapath	(Required) The path to the gene expression file.
bcroriginal	(Required) The path to the BCR original embedding file.
outdir	(Required) The directory where the best checkpoint file and the output embeddings will be stored.
checkpoint	Default is "best_network.pth". This parameter specifies the name of the file where the best model checkpoint will be saved.
lr	Default is 1e-5.
lam	Default is 1e-1, the inner parameter (Parameter alpha in the paper).
batch_size	Default is 256.
epochs	Default is 200.
patience	Default is 15, the patience for early stopping.
lr_step	Default is [30,100]. These are the milestones for the MultiStepLR setting, which adjusts the learning rate at specified epochs.
encoderprofile_in_dim	Default is 5000. Adjust this parameter if the number of input genes differs from 5000.
separatebatch	The default is False. If set to True, BCRs from different samples will be treated as distinct BCRs. Ensure that your BCR input file contains a "sample" column if you choose to enable this option.

License

deepomicslab/CoMBCR

Folders and files

Latest commit

History

Repository files navigation

CoMBCR

Introduction

Prerequisites

Installation

Tutorial

Usage

Prepare input data

BCR sequences file

Gene expression file

Original BCR embeddings file

Quick run

Parameters of CoMBCR

Acknowledgements

Questions

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages