Integrating scanorama with Seurat? #87

rjb67 · 2019-05-16T15:07:22Z

rjb67
May 16, 2019

I am interested in trying out Scanorama to see how it fares with our data, but I have the issue that we have been working with our data and doing our processing and analysis in Seurat in R. Are there any plans to integrate Scanorama in a way that makes it easily usable with Seurat? Or is this already possible and I have not been able to figure it out?

brianhie · 2019-05-16T15:16:29Z

brianhie
May 16, 2019
Maintainer

Hi @rjb67,

There is an R interface you could try (example script here: https://github.com/brianhie/scanorama/blob/master/bin/R/scanorama.R). Just keep in mind that Scanorama uses cell-by-feature matrices whereas I think most R pipelines do feature-by-cell matrices.

Once you get a corrected.data output via the scanorama.correct() method, you should be able to treat this just as you would a regular gene expression matrix. I'm not very familiar with the internals of Seurat, but I'm assuming you should be able to initialize a Seurat object with a gene expression matrix?

I'll leave this issue open in case others have had success with Scanorama and Seurat integration.

0 replies

zhanghao-njmu · 2019-06-01T15:37:42Z

zhanghao-njmu
Jun 1, 2019

I have some problems when I use scanorama through the reticulate package in R. It seems that check_datasets() function cannot correctly work with "matrix" from R.

Installation from pip:
pip install scanorama

My code is :

datasets <- list()
genes_list <- list()
for (i in c("a","b","c","d","e")) {
  datasets[[i]] <- get(i) %>% as.matrix() %>% t()
  # datasets[[i]] <- cbind(rownames(datasets[[i]]),datasets[[i]])  ## test for the check_datasets()
  # datasets[[i]] <- rbind(colnames(datasets[[i]]),datasets[[i]])  ## test for the check_datasets()
  genes_list[[i]] <- colnames(datasets[[i]])
}
start <- proc.time()
scanorama <- import('scanorama')
integrated.data <- scanorama$integrate(datasets, genes_list)
corrected.data <- scanorama$correct(datasets, genes_list, return_dense=TRUE)
end <- proc.time()
print(end - start)

Run it and then make an error:

Error in py_call_impl(callable, dots$args, dots$keywords) : SystemExit: 1

Detailed traceback:
File "/home/reprod/bin/pyenv/versions/anaconda3-5.1.0/lib/python3.6/site-packages/scanorama/scanorama.py", line 157, in integrate
datasets_full = check_datasets(datasets_full)
File "/home/reprod/bin/pyenv/versions/anaconda3-5.1.0/lib/python3.6/site-packages/scanorama/scanorama.py", line 345, in check_datasets
exit(1)
File "/home/reprod/bin/pyenv/versions/anaconda3-5.1.0/lib/python3.6/_sitebuiltins.py", line 26, in call
raise SystemExit(code)

How can I make R "matrix" compatible with scanorama?

0 replies

brianhie · 2019-06-01T21:06:53Z

brianhie
Jun 1, 2019
Maintainer

Hi @zh542370159, sorry I'm not really sure about how reticulate and R treat the data types you are passing in. The script at https://github.com/brianhie/scanorama/blob/master/bin/R/scanorama.R should work with R version 3.5.1 and reticulate version 1.10. That script should help you debug your script.

0 replies

zhanghao-njmu · 2019-06-02T13:59:56Z

zhanghao-njmu
Jun 2, 2019

Thank you. It's my fault.

reticulate package cannot process a list containing the named matrix when passing them to the python. In that case, Python will only get the names instead of the matrix in the list.

0 replies

yaoyidai · 2019-10-02T20:48:16Z

yaoyidai
Oct 2, 2019

@zh542370159
Hi, I'm wondering how did you finally work it out? I'm currently trying to pass a merged Seurat object which contains 12 objects to using the scanorama. Will you able to share your code that how you finally get this done? Also, I am using R.

0 replies

gdagstn · 2019-11-08T09:07:31Z

gdagstn
Nov 8, 2019

@Thegreatjoyce

Hi, going from Seurat to scanorama (and then back) is pretty simple.

Three things are important:

the way assays are stored in Seurat (as in most R objects containing gene expression values) is, in rows by columns, genes by cells. As @brianhie said, scanorama requires the transpose, i.e. cells by genes.
as @zh542370159 pointed out, the list must not be named.
when creating/manipulating Seurat objects you must always keep the consistency between rownames and colnames.

So when you want to create your list of assays to pass to scanorama, assuming you have a list of Seurat objects in which the active assay is the normalized data (SCT or otherwise), you have to extract them using the convenience function Seurat::GetAssayData:

assaylist <- list()
genelist <- list()
for(i in 1:length(seuratobjetclist))
{
   assaylist[[i]] <- t(as.matrix(GetAssayData(seuratobjectlist[[i]], "data")))
   genelist[[i]] <- rownames(seuratobjetclist[[i]])
}

At this point you can run the scanorama pipeline, calling it via reticulate

integrated.data <- scanorama$integrate(assaylist, genelist)
corrected.data <- scanorama$correct(assaylist, genelist, return_dense=TRUE)
integrated.corrected.data <- scanorama$correct(assaylist, genelist, return_dimred=TRUE, return_dense=TRUE)

Then, to go back to Seurat with your nicely integrated data, we extract from the objects we just created the corrected counts (element number 2), and the dimensional reduction embeddings (element number 1), and collate each of them together. We also use the common genes (element number 3) to give rownames to the integrated, batch-corrected expression matrix.
Notice how in the first line of code we need to transpose the matrices again in order to have them in the appropriate format.

intdata <- lapply(integrated.corrected.data[[2]], t)
panorama <- do.call(cbind, intdata)
rownames(panorama) <- as.character(integrated.corrected.data[[3]])
colnames(panorama) <- unlist(sapply(assaylist, rownames))

intdimred <- do.call(rbind, integrated.corrected.data[[1]])
colnames(intdimred) <- paste0("PC_", 1:100)

   #We also add standard deviations in order to draw Elbow Plots in Seurat

stdevs <- apply(intdimred, MARGIN = 2, FUN = sd)

Creating Seurat object is trivial at this point, and we can skip the normalization and variable feature selection as we already have our PCA embeddings from scanorama:

pan.seurat <- CreateSeuratObject(counts = panorama, assay = "pano",  project = "yourproject")

  #Adding metadata from all previous objects 
pan.seurat@meta.data <- do.call(rbind, lapply(seuratobjectlist, function(x) x@meta.data))
  
    # VERY IMPORTANT: make sure that the rownames of your metadata slot 
    # are the same as the colnames of your integrated expression matrix 

rownames(pan.seurat@meta.data) <- colnames(pan.seurat)
rownames(intdimred) <- colnames(pan.seurat)

pan.seurat[["pca"]] <- CreateDimReducObject(embeddings = intdimred, stdev = stdevs, key = "PC_", assay = "pano")

At this point you can go forward with the usual workflow (don't keep the parameters I write, just use anything you see fit)

pan.seurat <- FindNeighbors(pan.seurat, dims = 1:10) 
FindClusters(pan.seurat)
FindAllMarkers(pan.seurat)

HTH

0 replies

xhbkirby · 2020-06-22T12:29:19Z

xhbkirby
Jun 22, 2020

@gdagstn Thanks for the detailed walkthrough between scanorama and seurat! In your code you used the "data" slot from the seurat objects to perform scanorama correction, and later used the corrected data as UMI counts to initiate a new seurat object. Since the "data" slot contains the log-normalized data, would it be more reasonable to use the "counts" slot instead to perform scanorama correction?

After creating a new seurat object with corrected counts, is it legit to perform NormalizeData followed by ScaleData as the standard seurat workflow? I'd like to regress out some variables before running PCA so I don't want to use the scanorama integrated PC embedding directly in the seurat object. But since the corrected counts/data are already L2 normalized by scanorama by default, I'm not sure if it's legit to further do log-normalization and then scaling on such type of data or the corrected values should be used directly as input for PCA?

0 replies

gdagstn · 2020-06-24T03:18:29Z

gdagstn
Jun 24, 2020

Hi @xhbkirby,

I think that @brianhie is obviously the most suited to answer your questions.

However, I can give you my 1.5 cents.

Brian has answered in some issues (see #54 and #68) that all the usual preprocessing steps can (and should) be carried out before integration. In the context of your specific question, this makes sense since you want to remove uninteresting sources of variation before the joint space is learned by scanorama, since l2 normalization won't really take care of "batch" or other effects that are better modelled by other approaches. Once you have the corrected object you can use the joint space as input for SNN and clustering.
You can use corrected counts for visualization of gene expression, although they will be heavily transmogrified so you just want to interpret the relative intensities and not the absolute values.

Now, if instead you want to use the corrected counts for differential expression, I think you would have to find a test that suits the distribution of those normalized, corrected counts, and I have not made any serious attempt myself.
I would contend that the best approach would be using negative binomial-based methods (i.e. edgeR QLF) on raw counts, in which you can add covariates to the model when possible, so as to block for batch effects. Say you are integrating datasets coming from different studies that all relate to the same comparison between healthy and diseased patients. Once you learn the joint panorama and perform clustering, you could do a DE test within clusters and between conditions using the raw counts and blocking for study (as well as sex, age, etc).
Obviously this may not work great if you're trying to do DE between two populations of the stitched panorama that are specific to two studies or conditions (e.g. to find "cluster markers"). In that case I think using the scanorama corrected counts would work better. However if you read Aaron Lun's recommendations on the usage of fastMNN, which is based on the same principles as MNNcorrect (on which the scanorama correction is also based), this is not very advisable.

As an alternative that may be more robust to batches I would consider looking at coexpression, which is implemented in @brianhie's trajectorama (see here). This does not directly tell you differences in gene expression across conditions, but can be leveraged to understand gene expression modules.

HTH

0 replies

xhbkirby · 2020-06-24T05:17:07Z

xhbkirby
Jun 24, 2020

Hi @gdagstn,

Thanks for your quick reply and thoughts on differential expression. In fact, I saw in the Seurat forum there're also a lot of discussion about whether to use integrated/corrected data or original count data to perform DE, and the Seurat developers recommend the latter.

Regarding the first part of your response, am I understanding correctly that you recommend removing unwanted variances (regression) on raw counts or log-transformed data, and then use the "regressed" data as input for scanorama$correct? Seurat performs regression at the ScaleData step, which will return a regressed matrix with centered and scaled data, but the raw counts or log-transformed data won't get regressed. According to @brianhie's suggestions in #68 it seems we shouldn't use scaled data as input for scanorama? So if I use the raw counts extracted from a Seurat object to run scanorama$correct (without running scanorama$integrate at all) and then generate a new Seurat object with the corrected data as you did in your code, can I apply log-transformation with NormalizeData on the scanorama-corrected data? And after that can I apply centering and scaling with ScaleData (at the same time regressing out unwanted variables), which will produce scaled data to be used in downstream PCA within Seurat?

Please bear me if any of my questions don't make much sense. I'm just not sure if the scanorama-corrected (and I assume also l2-normalized) output data can be log-transformed and centered/scaled again following the Seurat pipeline.

0 replies

gdagstn · 2020-06-24T06:01:59Z

gdagstn
Jun 24, 2020

@xhbkirby

I think you can and should "regress out" unwanted sources of variation prior to integration. You would normally do that after computing HVGs for PCA in a normal workflow (i.e. without integration) and you do that, or something similar, when using other integration/batch correction methods. As I mentioned earlier the reasoning is that you don't want spurious populations to appear as a consequence of retaining uninteresting sources of variation. The reply on issue 68 does not explicitly say to not use scaled/regressed data, and I personally don't see why it would be an issue to do so, but I may be wrong.

Going the other way around, i.e. your suggestion of correcting raw counts without integration and then applying log normalization on top of l2 normalization and then scaling the corrected data with regression seems like a suboptimal choice, because what "regressing out" means in Seurat is getting the residuals of a linear (or poisson or nb or...) model fit to each gene. So this means that an assumption on the distribution is made, and it will most likely be violated by corrected, L2 normalized data. I have not tried this myself though, so again I may be wrong.

If you want you can actually implement l2 normalization yourself in R and play with the pbmc3k dataset to see what happens when you apply different normalisations in different orders:

L2norms <- apply(count.table, 1, function(x) norm(x, type = "2"))
L2normalized.data <- count.table / L2norms
# Sanity check: all squared sums must be 1. Rounding to the 5th digit for precision issues
 all(round(apply(L2normalized.data, 1, function(x) sum(x^2)), digits = 5) == 1)
[1] TRUE

edit to be honest it's not entirely clear to me why one would calculate HVGs before scaling the data, as the sources of variation will change after regression. It's true that scaling HVGs only saves time and memory at the time of computing the PCA, but still. I saw there's a github issue on this specific topic on which you also commented, so I'll be interested in the answer as well.

0 replies

xhbkirby · 2020-06-24T06:21:43Z

xhbkirby
Jun 24, 2020

Hi @gdagstn,

Thanks for the clarification! I agree that the second workflow will most likely change the underlying distribution of the data and might make regression in Seurat perform not the way as it should be. I'll try scanorama correction with the scaled/regressed data and see how that works.

0 replies

lucygarner · 2023-03-01T10:00:42Z

lucygarner
Mar 1, 2023

Hi @gdagstn,

I am following your code, however I am getting an error:
ERROR: Data sets must be numpy array or scipy.sparse.csr_matrix, received type <class 'str'>.

It looks like your datasets are matrices from what I can see, but am I missing something?

I am getting my datasets variable as follows:

extract_matrix <- function(seurat_object, 
                           feat) {
    DefaultAssay(seurat_object) <- "SCT"
    seurat_object <- subset(seurat_object, features = feat)
    gene_matrix <- as.matrix(GetAssayData(seurat_object, slot = "data"))
    return(t(gene_matrix))
}

datasets <- map(donors, extract_matrix, feat = int_features)

Best wishes,
Lucy

2 replies

gdagstn Mar 1, 2023

Hi @lucygarner

TL;DR you have to un-name your datasets list.

Long version:

Let me go through the whole thing once again, borrowing the beginning from a Seurat tutorial. I trust you have installed scanorama in whatever form you prefer (system Python or conda environment) and then have set the reticulate variables accordingly.

# Load libraries
library(purrr)
library(SeuratData)
library(Seurat)
library(reticulate)
scanorama <- import("scanorama")

# Extraction function
extract_matrix <- function(seurat_object, 
                           feat) {
    DefaultAssay(seurat_object) <- "SCT"
    seurat_object <- subset(seurat_object, features = feat)
    gene_matrix <- as.matrix(GetAssayData(seurat_object, slot = "data"))
    return(t(gene_matrix))
}

# load dataset
InstallData("ifnb")
ifnb = LoadData("ifnb") # this is how it worked for me but idk

# split the dataset into a list of two seurat objects (stim and CTRL)
ifnb.list <- SplitObject(ifnb, split.by = "stim")

# apply SCTransform to every object
ifnb.list <- lapply(X = ifnb.list, FUN = SCTransform)

# get variable features
genelist = lapply(ifnb.list, VariableFeatures)

# apply extraction function - we use 2-argument map
datasets <- map2(.x = ifnb.list, .y = genelist, .f = extract_matrix)

# important: remove names
names(datasets) = NULL
names(genelist) = NULL

integrated.data <- scanorama$integrate(datasets, genelist)

Output:

Found 1368 genes among all datasets
[[0.         0.61270617]
 [0.         0.        ]]
Processing datasets (0, 1)

This is what happens if I don't remove the names:

genelist = lapply(ifnb.list, VariableFeatures)
datasets <- map2(.x = ifnb.list, .y = genelist, .f = extract_matrix)

integrated.data <- scanorama$integrate(datasets, genelist)

Output:

ERROR: Data sets must be numpy array or scipy.sparse.csr_matrix, received type <class 'str'>.

probably because by passing a named list Python only sees the name (a str)

lucygarner Mar 1, 2023

Thank you. Sorry I missed that!

tetsy2 · 2023-08-03T06:38:18Z

tetsy2
Aug 3, 2023

Hello @gdagstn!

I am trying to use scanorama in R with my own data and it is being a problem. Everything goes well but in the last line of code I get this:
Error in py_get_attr_impl(x, name, silent) :
AttributeError: module 'scanorama' has no attribute 'integrate'
Run reticulate::py_last_error() for details.

Here it is the code I am using, that it is basicallly the one you posted before. I used two seurat objects I am reading individually (M003 and M108607). Thanks!

library(purrr)
library(SeuratData)
library(Seurat)
library(reticulate)
scanorama <- import("scanorama")

extract_matrix <- function(seurat_object,
feat) {
DefaultAssay(seurat_object) <- "SCT"
seurat_object <- subset(seurat_object, features = feat)
gene_matrix <- as.matrix(GetAssayData(seurat_object, slot = "data"))
return(t(gene_matrix))
}

results_list <- list()
results_list[[1]] <- M003
results_list[[2]]= M108607

ifnb.list <- lapply(X = results_list, FUN = SCTransform)

genelist = lapply(ifnb.list, VariableFeatures)

datasets <- map2(.x = ifnb.list, .y = genelist, .f = extract_matrix)
names(datasets) = NULL
names(genelist) = NULL

integrated.data <- scanorama$integrate(datasets, genelist)

1 reply

gdagstn Aug 8, 2023

Hi @tetsy2,
Not a developer/maintainer here but your issue looks like it could be a problem with your installation of either reticulate, scanorama, or both.

In my case I just tried on my setup and it works:

library(reticulate)
scanorama <- import("scanorama")
scanorama$integrate

<function integrate at 0x13dee6c00>

I would suggest trying a clean install of everything, possible in a new virtual environment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrating scanorama with Seurat? #87

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 13 comments 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Integrating scanorama with Seurat? #87

Replies: 13 comments · 3 replies

brianhie May 16, 2019 Maintainer

brianhie Jun 1, 2019 Maintainer

Replies: 13 comments 3 replies

brianhie
May 16, 2019
Maintainer

brianhie
Jun 1, 2019
Maintainer