log-normalized or raw gene expression counts as input to Scanorama #113

antonioggsousa · 2022-01-20T11:23:46Z

antonioggsousa
Jan 20, 2022

Thank you and your colleagues for developing scanorama!
I'm testing it through a few "dummy" examples and I'm delighted with the results.

I read the paper as well as one of the tutorials mentioned in the github README.md file.

In order to test scanorama, I run it with a few toy data sets in addition to one example data set highlighted in the scanorama repository. When I started with the toy data sets I provided scaled counts to scanorama by mistake due to the less familiarity with scanpy, anndata and python in general. Therefore, I checked the paper and the tutorial again to find which input scanorama requires. The tutorial mentions at some point log-normalized gene expression counts whereas the paper mentions that l2 normalization is performed internally. If I understood correctly it aims to standardize the cells to the same scale, i.e., to unit norm. Thus, its application is not necessarily dependent on previous normalization. Then, my question is: which should ideally be the input to scanorama, log-normalized or raw counts?

Regarding the tests that I've performed, the results obtained with raw counts seem slightly better than the ones obtained with log-normalized counts.

Another small doubt that I've is related with the integration result, i.e., X_scanorama, that scanorama provides. My understanding is that this low-dimensional embedding is intended to be used for UMAP/t-SNE estimation and visualization (among others downstream tasks) based on the tutorial mentioned above and the paper. For instance, in the tutorial they calculate a neighborhood graph and UMAP with this result:

# tsne and umap
sc.pp.neighbors(adata, n_pcs =50, use_rep = "Scanorama")
sc.tl.umap(adata)
sc.tl.tsne(adata, n_pcs = 50, use_rep = "Scanorama")

If X_scanorama is a low dimensional embedding should we plot this directly?

Thank you and sorry for the off topic question!

Best regards,

António

brianhie · 2022-01-20T15:38:15Z

brianhie
Jan 20, 2022
Maintainer

Hi @antonioggsousa, this analysis: https://www.nature.com/articles/s41592-021-01336-8 reports that Scanorama works best with log normalization and scaling (they use Scanpy).

Yes, the output of Scanorama is the low dimensional embedding, which is used to compute the k-nearest neighbors graph, which is then used for visualization and clustering.

0 replies

flde · 2023-07-11T10:43:18Z

flde
Jul 11, 2023

@brianhie,

I think the documentation is not clear. Scanorama is doing l2 normalization with sklearn per cell and feature scaling with the pca functions default. That is not displayed to the user because the verbose=False in the process_data function

def process_data(datasets, genes, hvg=HVG, dimred=DIMRED, verbose=False):

For me that is confusing because if I use scanorama.correction with scaled log10(CPM) as recommended they get l2 normalized and scaled again. Does that make sense?

Best wishes,
Florian

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

log-normalized or raw gene expression counts as input to Scanorama #113

{{title}}

Replies: 2 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

log-normalized or raw gene expression counts as input to Scanorama #113

antonioggsousa Jan 20, 2022

Replies: 2 comments

brianhie Jan 20, 2022 Maintainer

flde Jul 11, 2023

antonioggsousa
Jan 20, 2022

brianhie
Jan 20, 2022
Maintainer

flde
Jul 11, 2023