scROCK is a deep-learning based algorithm for correcting cluster labels for scRNA-seq data, based on Xinchuan Zeng and Tony R. Martinez. 2001. An algorithm for correcting mislabeled data. Intell. Data Anal. 5, 6 (December 2001), 491–502.
scROCK uses a supervised machine learning algorithm called ADE (automatic data enhancement). It is based on ability of neural networks to remember first simple and correct class labels and then incorrect ones. During the training of a small neural network, scROCK maintains smoothed class probabilities for every sample, and changes dataset labels according to them.
Final class labels after a fixed number of epochs or batches are considered as "fixed" and returned by scrock
call.
pip install https://github.com/dos257/scROCK/tarball/master
For private repository use:
pip install git+https://{token}@github.com/dos257/scROCK.git
Use keys --upgrade --no-deps --force-reinstall
for forced update from the git repository.
If X
is log1p-preprocessed numpy.array
of shape (n_samples, n_genes)
and y
is integer clustering labels (from Leiden algorithm),
from scrock import scrock
y_fixed = scrock(X, y)
For convenience, scrock
supports a simplified command line:
python3 -m scrock refine_clusters data.h5ad
or find_doublets
instead of refine_clusters
For the refine_clusters
task, from file (here data.h5ad
) scrock
tries to read (in that order) .obs["seurat_clusters"]
, .obs["leiden"]
, .obs["cell_line"]
.
Also, this command line could be run inside Docker.
Build Docker image:
docker build --tag scrock-image .
Run Docker image passing host path with input file:
docker run --name scrock --volume /host-path-to-input/data:/data scrock-image refine_clusters /data/sce_sc_10x_5cl_qc.h5ad
Output will be written to stdout.
Library scrock
is included in a large Docker image scMultiFlow
, with Anaconda, RStudio, and bioinformatics libraries. To use it, run
docker pull bansalvi/scmultiflow
docker run --detach -p 18000:8000 -p 18787:8787 \
--volume "$(realpath ..)/data":/data \
--name scmultiflow-container bansalvi/scmultiflow
If code consumes a high CPU percent (but still works slowly), try:
torch.set_num_threads(1)
Torch imperfect CPU parallelization spends most of the time in thread synchronization and slows down the whole process.
Main entry point of the library is the scrock
function:
def scrock(
X, y,
D = 0.9,
l_ps = [1.0, 1.25, 1.5],
net_factory = None,
label_update_factory = None,
optimizer = 'adam', optimizer_lr = 0.0001, n_epochs = 40, n_batches = None, batch_size = 32,
batch_scheme = 'random-shuffle',
voting_scheme = voting_scheme_max_p_sum,
verbose = 1,
check_data = True,
seed = 0,
return_proba = False,
):
This function runs the scROCK ensemble algorithm, using X
and y
as the input dataset. It averages ADE algorithm results (3 by default, with parameters from l_ps
) by voting_scheme
.
Parameter net_factory
configures neural network, parameters optimizer
, optimizer_lr
, n_epochs
, n_batches
, batch_size
, batch_scheme
how it trains, parameter label_update_factory
how algorithm updates class labels during training.
Parameter verbose
controls debug output. Parameter seed
is used for random generator initialization for ensemble items (by seed + ialgo
).
If parameter return_proba
is True
, scrock
returns y_pred, y_pred_proba
instead of y_pred
.
Function scrock
wraps methods of class ADEReClassifier
:
class ADEReClassifier(BaseReClassifier):
def __init__(
self,
l_p = 1.0,
D = 0.9,
net = None,
label_update = None,
label_update_first_update = 300,
optimizer = 'adam', optimizer_lr = 1e-4, n_epochs = 40, n_batches = None, batch_size = 32, batch_scheme = 'random-shuffle',
add_on_each_batch = [],
collect = [],
verbose = 0,
seed = 0,
):
...
def fit(self, X, y):
...
def predict(self):
...
def predict_proba(self):
...
scROCK was developed under the supervision of Dr. Vikas Bansal (Head of Biomedical Data Science Group at German Center for Neurodegenerative Diseases (DZNE), Tübingen).