Embedding Quality Difference #1161

erenozcelik · 2024-11-18T21:17:32Z

When using Parametric UMAP for supervised tasks, the quality of the embeddings is significantly worse compared to the embeddings produced by standard UMAP. This difference is observed across multiple datasets and configurations. What could be the reason and can it be improved?

timsainb · 2024-11-19T02:31:05Z

If you provide a specific issue and colab link reproducing it I can take a look. As it stands this issue is too vaguely described.

erenozcelik · 2024-12-05T09:52:31Z

Hi,

Here is the colab link comparing parametric UMAP and standard UMAP for supervised FMNIST.
Open in Colab

erenozcelik · 2025-01-31T13:06:30Z

Hello @timsainb,

Were you able to look at it?

timsainb · 2025-02-01T19:15:54Z

Thanks for providing the colab notebook. Note that you are plotting the results on the training data here and not the held out testing data. This is very important when you consider the difference between parametric umap and umap. Supervised nonparametric umap is performing an embedding by balancing your distance metric in data spece (e.g. euclidean distance) and in categorical space. If you were to set the balance to 100% categorical distance, you would get perfect separation between classes, but it wouldn't practically tell you anything about your data. Parametric UMAP can't do that because the embedding is parametrically related to the input data using a neural network. Imagine you were to sample data as two classes from the same gaussian distribution. Since it comes from the same distribution, even a supervised neural network won't allow you to separate the classes.

erenozcelik · 2025-02-06T10:55:02Z

Thanks for your quick response.
I have concatenated the training and test datasets at the beginning and used the same fused dataset for both supervised methods. There is no hold out test data in this case.

Combine train and test images
train_images = np.concatenate((train_images, test_images), axis=0)

Just for the sake of completeness and to highlight your important remark, I separated the test dataset and trained with training dataset only. The test embeddings plotted with parametric and nonparametric UMAP. I cannot identify a significant difference for held out test data with these methods.

According to my understanding, the Nearest Neighbor Graph (NNG) in data space for supervised nonparametric UMAP is only modified if two neighboring data points belong to two different categories. Otherwise, the NNG should be effectively the same as in the unsupervised setting and it should tell us the same information about our data. Do you mean for balancing to use default "target_weight=0.5" e.g., "50% of the categorical distance" and "50% of the Euclidian distance" in supervised nonparametric UMAP setting (which has an impact on the "far_dist")?

Are NNGs (self. graph_) not the same to match in embedding space for parametric and nonparametric UMAP when we additionally use categorical information? Since, the "fit" method of the base class (nonparametric) is called from parametric UMAP (super().fit(X, y)). The initialization of embedding space seems to be different in nonparametric UMAP case (spectral clustering) which is later optimized with Euclidian layout function. I have also tried to pre-train the encoder weights of parametric UMAP with spectral clustering embeddings, but the final embeddings did not differ considerably at the end. I expected that weight updates of the encoder network produce similar embeddings as in the case of Euclidian layout since they use the negative sampling, similar gradient clipping, etc.

Colab Notebook

erenozcelik changed the title ~~Embeddin~~ Embedding Quality Difference Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Embedding Quality Difference #1161

Embedding Quality Difference #1161

erenozcelik commented Nov 18, 2024

timsainb commented Nov 19, 2024

erenozcelik commented Dec 5, 2024 •

edited

Loading

erenozcelik commented Jan 31, 2025

timsainb commented Feb 1, 2025

erenozcelik commented Feb 6, 2025

Embedding Quality Difference #1161

Embedding Quality Difference #1161

Comments

erenozcelik commented Nov 18, 2024

timsainb commented Nov 19, 2024

erenozcelik commented Dec 5, 2024 • edited Loading

erenozcelik commented Jan 31, 2025

timsainb commented Feb 1, 2025

erenozcelik commented Feb 6, 2025

erenozcelik commented Dec 5, 2024 •

edited

Loading