Using Scanpy's UMAP (calculated before merging) for adding trajectories in scvelo #1086

chris-31337 · 2023-06-13T12:48:03Z

chris-31337
Jun 13, 2023

Dear @WeilerP and the Scvelo team,

Based on your previous comments [1], I keep revisiting the question if it is "allowed" to plot the scvelo (0.2.5) trajectories onto a UMAP calculated from a previous (Scanpy 1.9.2) preprocessing pipeline performed on the same dataset (which was run before merging with the spliced/unspliced layers).

I understand that raw counts should be supplied in .X to scvelo for reliable results [2,3]. But I am not sure to which extent I can re-use the UMAP calculated in Scanpy (before merging with spliced/unspliced loom data) for scvelo. Reusing it would be advantageous because it would allow plotting the trajectory data onto a familiar structure used for other figures and previous analyses (letting aside for the moment that low-dimensional display of data has its own limitations and should not be overinterpreted).

In [1] you have written in response to a similar topic (reusing UMAP from Scanpy):

The problem with [this] approach is that the data used to calculate the neighbor graph and the data used to infer RNA velocity are not the same. This could lead to all kinds of unexpected results, and I am not sure how you'd interpret a velocity stream, for example, plotted on a 2D UMAP embedding generated from different data.

I would take this as "do NOT use UMAP calculated before merging, even from the same dataset, because the filtering and neighbor graph would be different", but this is in contrast to the analysis templates used by many people I know and also some online tutorials [4,5]. Even the 10x tutorial [6] appears to use UMAP calculated elsewhere (from its proprietary loupe browser) before merging and then overlaying the velocity results onto that imported UMAP, without rerunning UMAP calculation after filtering. Finally, importing UMAPs from Seurat seems to be possible as well [7], implying that it is not impossible to reuse UMAPs calculated elsewhere with trajectory data inferred later.

Unfortunately the scvelo tutorial [3] skips the UMAP step and states that "[the tutorial data] has an already pre-computed UMAP embedding". Hence it remains unclear if that pre-computed UMAP may also be derived from an independent pre-processing pipeline applied to that dataset (before merging with the spliced/unspliced layers) or HAS to be (re-)generated based on the merged and scvelo-filtered dataset.

Are all of the approaches involving preprocessed UMAPs wrong? Or am I overinterpreting/misunderstanding your statements in [1].

What is the recommended way of proceeding?

So far, I think I have the following options:

1. Using Scanpy's preprocessing entirely and normalizing only the splicing layers after merging

adata = sc.read_h5ad('Scanpy_result.h5ad')
ldata = scv.read(Velocyto_run10x_output.loom)
merged = scv.utils.merge(adata, ldata)
scv.pp.filter_and_normalize(merged) # This only treats the extra layers because it recognizes that .X is already changed
scv.pp.moments(merged, n_pcs=30, n_neighbors=30) # or should n_pcs and n_neighbors be set to None here when reusing previous PCA from Scanpy?
scv.tl.velocity(merged)
scv.tl.velocity_graph(merged)
scv.pl.velocity_embedding_stream(merged, basis='umap', color=['leiden']) # using UMAP and leiden clusters from previous analysis

2. Reverting to raw counts in .X for scvelo but otherwise using Scanpy's UMAP

adata = sc.read_h5ad('Scanpy_result.h5ad')
adata.X = sparse.csr_matrix(adata.layers["counts"]).astype(int) # count layer stored during scanpy analysis
ldata = scv.read(Velocyto_run10x_output.loom)
merged = scv.utils.merge(adata, ldata)
scv.pp.filter_and_normalize(merged) # This now treats the spliced/unspliced layers as well as .X
scv.pp.moments(merged, n_pcs=30, n_neighbors=30) # or should n_pcs and n_neighbors be set to None here when reusing previous PCA from Scanpy?
scv.tl.velocity(merged)
scv.tl.velocity_graph(merged)
scv.pl.velocity_embedding_stream(merged, basis='umap', color=['leiden']) # using UMAP and leiden clusters from previous analysis

=> This approach generates a highly similar result to the first approach.

3. Reverting to raw counts in .X and redoing PCA and UMAP after merging

adata = sc.read_h5ad('Scanpy_result.h5ad')
adata.X = sparse.csr_matrix(adata.layers["counts"]).astype(int)
ldata = ldata = scv.read(Velocyto_run10x_output.loom)
merged = scv.utils.merge(adata, ldata)
scv.pp.filter_and_normalize(merged)
sc.tl.pca(merged)
sc.pp.neighbors(merged)
scv.pp.moments(merged, n_pcs=None, n_neighbors=None)
scv.tl.velocity(merged)
scv.tl.velocity_graph(merged)
merged.obsm['X_umap_old'] = merged.obsm['X_umap']
scv.tl.umap(merged)
scv.pl.velocity_embedding_stream(merged, basis='umap', color=['leiden']) # using new UMAP but leiden clusters from previous analysis

=> This approach generates a very different UMAP, which is hard to interpret when comparing to the original scanpy analysis. However, the velocity arrows seem to generally point in the same relative direction with respect to the old leiden clusters.

4. Redoing the entire analysis pipeline on untouched raw datasets, freshly merged using scvelo

=> This would mean repeating all steps around scrublet, QC cleanup of high mitochondrial and low ribosomal counts etc. with the merged dataset, as if all previous analysis had never happened.

Which of these four ways (if any) should be considered acceptable?

P.S.: In reference to [8], please note that my preprocessing DID determine highly variable genes but did NOT subset the dataset to HVG-only, since sc.tl.pca() defaults to using HVG anyway when .var['highly_variable'] is set [9]. Hence, the Scanpy_result.h5ad mentioned above contains the "full" dataset (minus removed duplicates and putatively dead cells).

P.P.S: Briefly, the preprocessing steps were as follows:

adata = sc.read_10x_mtx(...)
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt','ribo','hb'], percent_top=None, log1p=False, inplace=True)
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
adata = adata[adata.obs['pct_counts_mt'] < 15, :]
adata = adata[adata.obs['pct_counts_ribo'] > 5, :]
adata = adata[adata.obs['pct_counts_hb'] < 0.1, :]
adata = adata[adata.obs['n_genes_by_counts'] < 3000, :]
adata = adata[adata.obs['total_counts'] < 10000, :]
(scrublet to predict doublets)
adata = adata[adata.obs['predicted_doublets'] == False,:]
adata.layers["counts"] = adata.X.copy()
sc.pp.normalize_total(adata, target_sum=1E4, inplace=True)
sc.pp.log1p(adata)
adata.raw = adata
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5, batch_key="sample")
(no regression, no scaling for this dataset)
sc.tl.pca(adata, n_comps=50, svd_solver='arpack')
sc.pp.neighbors(adata, n_neighbors=20, n_pcs=30)
sc.tl.umap(adata)
sc.tl.leiden(adata) 
adata.write('Scanpy_result.h5ad')

References

[1] #775
[2] https://www.sc-best-practices.org/trajectories/rna_velocity.html
[3] https://scvelo.readthedocs.io/en/stable/VelocityBasics/
[4] https://smorabit.github.io/tutorials/8_velocyto/
[5] https://youtu.be/AUiYxtGJYtg?t=677
[6] https://www.10xgenomics.com/resources/analysis-guides/trajectory-analysis-using-10x-Genomics-single-cell-gene-expression-data
[7] #192
[8] #755
[9] https://scanpy.readthedocs.io/en/stable/generated/scanpy.tl.pca.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using Scanpy's UMAP (calculated before merging) for adding trajectories in scvelo #1086

{{title}}

Replies: 0 comments

Select a reply

Using Scanpy's UMAP (calculated before merging) for adding trajectories in scvelo #1086

chris-31337 Jun 13, 2023

What is the recommended way of proceeding?

References

Replies: 0 comments

chris-31337
Jun 13, 2023