UMAP on Billions of Data Points + Sample Weighting #1182

ghiggi · 2025-02-03T14:37:09Z

Hi everyone,

I’m exploring the use of UMAP on a very large dataset (roughly 1-10 billion rows with 10–15 columns).
I’m aware that fitting UMAP directly on such a large dataset is impossible, so here is my current plan:

Round or bin the data (e.g., rounding to 1 decimal place or integer bins) to reduce granularity.
Deduplicate the resulting rows while counting occurrences of each unique row (so each row is associated with a frequency/count).

This bring the dataset size down to the 10–50 million range.

Next, I want to incorporate the frequencies as sample weights (i.e., heavier weights for more frequent rows).

My question is: What is the best approach to incorporate sample weights into UMAP?

Some ideas I’ve considered include:

Custom distance metric that factors in sample frequency.
Precomputed distance matrix, although this might be infeasible for tens of millions of data points.
Custom sampling strategy prior to or during UMAP’s neighbor-finding step.

I’d love to hear any suggestions, best practices, or experiences you’ve had with:

Scaling UMAP to very large datasets (beyond straightforward sampling).
Incorporating sample weights effectively in manifold learning.
Approaches or code snippets that demonstrate custom distance metrics or neighbor selection based on weights.

Thanks in advance for any insights you can share.

I’m hoping this discussion will help me (and others) handle extremely large datasets more effectively with UMAP!

cc @lmcinnes

lmcinnes · 2025-02-03T19:55:40Z

Sample weighting is something I have given quite a bit of thought to over the years, but I didn't manage to come up with a truly satisfactory solution, so it is not something I have actually implemented. I think in essence what you want is to have the nearest neighbour computation proceed as normal, but the smooth_knn_dist computation account for sample weights. Then you'll want to have the optimization phase of the algorithm also account for sample weights when sampling edges. The first part of that is easy and tractable, the second is rather more tricky given the current optimization approach. Perhaps we can work out something that might be "good enough" for your needs to get the job done?

…

On Mon, Feb 3, 2025 at 9:37 AM Gionata Ghiggi ***@***.***> wrote: Hi everyone, I’m exploring the use of UMAP on a very large dataset (roughly 1-10 billion rows with 10–15 columns). I’m aware that fitting UMAP directly on such a large dataset is impossible, so here is my current plan: - Round or bin the data (e.g., rounding to 1 decimal place or integer bins) to reduce granularity. - Deduplicate the resulting rows while counting occurrences of each unique row (so each row is associated with a frequency/count). This bring the dataset size down to the 10–50 million range. Next, I want to incorporate the frequencies as sample weights (i.e., heavier weights for more frequent rows). My question is: *What is the best approach to incorporate these sample weights into UMAP?* Some ideas I’ve considered include: - Custom distance metric that factors in sample frequency. - Precomputed distance matrix, although this might be infeasible for tens of millions of data points. - Custom sampling strategy prior to or during UMAP’s neighbor-finding step. I’d love to hear any suggestions, best practices, or experiences you’ve had with: - Scaling UMAP to very large datasets (beyond straightforward sampling). - Incorporating sample weights effectively in manifold learning. - Approaches or code snippets that demonstrate custom distance metrics or neighbor selection based on weights. Thanks in advance for any insights you can share. I’m hoping this discussion will help me (and others) handle extremely large datasets more effectively with UMAP! cc @lmcinnes <https://github.com/lmcinnes> — Reply to this email directly, view it on GitHub <#1182>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AC3IUBIGKFJIANPLU4TH6YT2N55KVAVCNFSM6AAAAABWMIPCXOVHI2DSMVQWIX3LMV43ASLTON2WKOZSHAZDONZQGM2DOOA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

abs51295 · 2025-02-03T22:37:38Z

@ghiggi if you have access to a NVIDIA GPU, you can try computing nearest neighbors with CAGRA from cuVS. It's very fast and the recall was excellent on my data between brute force and using the default parameters of CAGRA. I had a dataset of 125 million data points with 15 dimensions and I could fit it on a NVIDIA A100 with 80G VRAM. Perhaps, you can try with a 100G H100.

ghiggi · 2025-02-12T15:52:20Z

Thanks for your inputs !

The purpose of this issue was to gather some initial insights and potential research directions.

We plan to begin experimentation next month, and I’ll follow up with preliminary results and/or reproducible code once available.
Thanks @lmcinnes for pointing us to the smooth_knn_dist function: it’s definitely something we’ll look into.

Additionally, we plan to explore the TorchDR library, as its UMAP implementation could significantly accelerate our experiments by leveraging GPUs.

Looking forward to sharing updates soon!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UMAP on Billions of Data Points + Sample Weighting #1182

UMAP on Billions of Data Points + Sample Weighting #1182

ghiggi commented Feb 3, 2025 •

edited

Loading

lmcinnes commented Feb 3, 2025 via email

abs51295 commented Feb 3, 2025

ghiggi commented Feb 12, 2025

UMAP on Billions of Data Points + Sample Weighting #1182

UMAP on Billions of Data Points + Sample Weighting #1182

Comments

ghiggi commented Feb 3, 2025 • edited Loading

lmcinnes commented Feb 3, 2025 via email

abs51295 commented Feb 3, 2025

ghiggi commented Feb 12, 2025

ghiggi commented Feb 3, 2025 •

edited

Loading