-
Notifications
You must be signed in to change notification settings - Fork 822
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UMAP on Billions of Data Points + Sample Weighting #1182
Comments
Sample weighting is something I have given quite a bit of thought to over
the years, but I didn't manage to come up with a truly satisfactory
solution, so it is not something I have actually implemented. I think in
essence what you want is to have the nearest neighbour computation proceed
as normal, but the smooth_knn_dist computation account for sample weights.
Then you'll want to have the optimization phase of the algorithm also
account for sample weights when sampling edges. The first part of that is
easy and tractable, the second is rather more tricky given the current
optimization approach. Perhaps we can work out something that might be
"good enough" for your needs to get the job done?
…On Mon, Feb 3, 2025 at 9:37 AM Gionata Ghiggi ***@***.***> wrote:
Hi everyone,
I’m exploring the use of UMAP on a very large dataset (roughly 1-10
billion rows with 10–15 columns).
I’m aware that fitting UMAP directly on such a large dataset is
impossible, so here is my current plan:
- Round or bin the data (e.g., rounding to 1 decimal place or integer
bins) to reduce granularity.
- Deduplicate the resulting rows while counting occurrences of each
unique row (so each row is associated with a frequency/count).
This bring the dataset size down to the 10–50 million range.
Next, I want to incorporate the frequencies as sample weights (i.e.,
heavier weights for more frequent rows).
My question is: *What is the best approach to incorporate these sample
weights into UMAP?*
Some ideas I’ve considered include:
- Custom distance metric that factors in sample frequency.
- Precomputed distance matrix, although this might be infeasible for
tens of millions of data points.
- Custom sampling strategy prior to or during UMAP’s neighbor-finding
step.
I’d love to hear any suggestions, best practices, or experiences you’ve
had with:
- Scaling UMAP to very large datasets (beyond straightforward
sampling).
- Incorporating sample weights effectively in manifold learning.
- Approaches or code snippets that demonstrate custom distance metrics
or neighbor selection based on weights.
Thanks in advance for any insights you can share.
I’m hoping this discussion will help me (and others) handle extremely
large datasets more effectively with UMAP!
cc @lmcinnes <https://github.com/lmcinnes>
—
Reply to this email directly, view it on GitHub
<#1182>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC3IUBIGKFJIANPLU4TH6YT2N55KVAVCNFSM6AAAAABWMIPCXOVHI2DSMVQWIX3LMV43ASLTON2WKOZSHAZDONZQGM2DOOA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@ghiggi if you have access to a NVIDIA GPU, you can try computing nearest neighbors with CAGRA from cuVS. It's very fast and the recall was excellent on my data between brute force and using the default parameters of CAGRA. I had a dataset of 125 million data points with 15 dimensions and I could fit it on a NVIDIA A100 with 80G VRAM. Perhaps, you can try with a 100G H100. |
Thanks for your inputs ! The purpose of this issue was to gather some initial insights and potential research directions. We plan to begin experimentation next month, and I’ll follow up with preliminary results and/or reproducible code once available. Additionally, we plan to explore the TorchDR library, as its UMAP implementation could significantly accelerate our experiments by leveraging GPUs. Looking forward to sharing updates soon! |
Hi everyone,
I’m exploring the use of UMAP on a very large dataset (roughly 1-10 billion rows with 10–15 columns).
I’m aware that fitting UMAP directly on such a large dataset is impossible, so here is my current plan:
This bring the dataset size down to the 10–50 million range.
Next, I want to incorporate the frequencies as sample weights (i.e., heavier weights for more frequent rows).
My question is: What is the best approach to incorporate sample weights into UMAP?
Some ideas I’ve considered include:
I’d love to hear any suggestions, best practices, or experiences you’ve had with:
Thanks in advance for any insights you can share.
I’m hoping this discussion will help me (and others) handle extremely large datasets more effectively with UMAP!
cc @lmcinnes
The text was updated successfully, but these errors were encountered: