Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UMAP on Billions of Data Points + Sample Weighting #1182

Open
ghiggi opened this issue Feb 3, 2025 · 3 comments
Open

UMAP on Billions of Data Points + Sample Weighting #1182

ghiggi opened this issue Feb 3, 2025 · 3 comments

Comments

@ghiggi
Copy link

ghiggi commented Feb 3, 2025

Hi everyone,

I’m exploring the use of UMAP on a very large dataset (roughly 1-10 billion rows with 10–15 columns).
I’m aware that fitting UMAP directly on such a large dataset is impossible, so here is my current plan:

  • Round or bin the data (e.g., rounding to 1 decimal place or integer bins) to reduce granularity.
  • Deduplicate the resulting rows while counting occurrences of each unique row (so each row is associated with a frequency/count).

This bring the dataset size down to the 10–50 million range.

Next, I want to incorporate the frequencies as sample weights (i.e., heavier weights for more frequent rows).

My question is: What is the best approach to incorporate sample weights into UMAP?

Some ideas I’ve considered include:

  • Custom distance metric that factors in sample frequency.
  • Precomputed distance matrix, although this might be infeasible for tens of millions of data points.
  • Custom sampling strategy prior to or during UMAP’s neighbor-finding step.

I’d love to hear any suggestions, best practices, or experiences you’ve had with:

  • Scaling UMAP to very large datasets (beyond straightforward sampling).
  • Incorporating sample weights effectively in manifold learning.
  • Approaches or code snippets that demonstrate custom distance metrics or neighbor selection based on weights.

Thanks in advance for any insights you can share.

I’m hoping this discussion will help me (and others) handle extremely large datasets more effectively with UMAP!

cc @lmcinnes

@lmcinnes
Copy link
Owner

lmcinnes commented Feb 3, 2025 via email

@abs51295
Copy link

abs51295 commented Feb 3, 2025

@ghiggi if you have access to a NVIDIA GPU, you can try computing nearest neighbors with CAGRA from cuVS. It's very fast and the recall was excellent on my data between brute force and using the default parameters of CAGRA. I had a dataset of 125 million data points with 15 dimensions and I could fit it on a NVIDIA A100 with 80G VRAM. Perhaps, you can try with a 100G H100.

@ghiggi
Copy link
Author

ghiggi commented Feb 12, 2025

Thanks for your inputs !

The purpose of this issue was to gather some initial insights and potential research directions.

We plan to begin experimentation next month, and I’ll follow up with preliminary results and/or reproducible code once available.
Thanks @lmcinnes for pointing us to the smooth_knn_dist function: it’s definitely something we’ll look into.

Additionally, we plan to explore the TorchDR library, as its UMAP implementation could significantly accelerate our experiments by leveraging GPUs.

Looking forward to sharing updates soon!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants