log-normalized or raw gene expression counts as input to Scanorama #113
Replies: 2 comments
-
Hi @antonioggsousa, this analysis: https://www.nature.com/articles/s41592-021-01336-8 reports that Scanorama works best with log normalization and scaling (they use Scanpy). Yes, the output of Scanorama is the low dimensional embedding, which is used to compute the k-nearest neighbors graph, which is then used for visualization and clustering. |
Beta Was this translation helpful? Give feedback.
-
I think the documentation is not clear. Scanorama is doing l2 normalization with sklearn per cell and feature scaling with the pca functions default. That is not displayed to the user because the verbose=False in the process_data function def process_data(datasets, genes, hvg=HVG, dimred=DIMRED, verbose=False): For me that is confusing because if I use scanorama.correction with scaled log10(CPM) as recommended they get l2 normalized and scaled again. Does that make sense? Best wishes, |
Beta Was this translation helpful? Give feedback.
-
Dear @brianhie,
Thank you and your colleagues for developing
scanorama
!I'm testing it through a few "dummy" examples and I'm delighted with the results.
I read the paper as well as one of the tutorials mentioned in the github README.md file.
In order to test
scanorama
, I run it with a few toy data sets in addition to one example data set highlighted in thescanorama
repository. When I started with the toy data sets I provided scaled counts toscanorama
by mistake due to the less familiarity withscanpy
,anndata
andpython
in general. Therefore, I checked the paper and the tutorial again to find which inputscanorama
requires. The tutorial mentions at some point log-normalized gene expression counts whereas the paper mentions thatl2 normalization
is performed internally. If I understood correctly it aims to standardize the cells to the same scale, i.e., to unit norm. Thus, its application is not necessarily dependent on previous normalization. Then, my question is: which should ideally be the input toscanorama
, log-normalized or raw counts?Regarding the tests that I've performed, the results obtained with raw counts seem slightly better than the ones obtained with log-normalized counts.
Another small doubt that I've is related with the integration result, i.e.,
X_scanorama
, thatscanorama
provides. My understanding is that this low-dimensional embedding is intended to be used for UMAP/t-SNE estimation and visualization (among others downstream tasks) based on the tutorial mentioned above and the paper. For instance, in the tutorial they calculate a neighborhood graph and UMAP with this result:If
X_scanorama
is a low dimensional embedding should we plot this directly?Thank you and sorry for the off topic question!
Best regards,
António
Beta Was this translation helpful? Give feedback.
All reactions