Questions about implications of "Is Cosine-Similarity of Embeddings Really About Similarity?" paper for BERTopic #2007

linxule · 2024-05-23T08:33:28Z

linxule
May 23, 2024

Hi BERTopic community,

I recently came across the paper "Is Cosine-Similarity of Embeddings Really About Similarity?" by Steck et al. (https://arxiv.org/abs/2403.05440) which raises some important points about the potential pitfalls of blindly using cosine similarity on embeddings derived from regularized models.

As a user of BERTopic, which relies on cosine similarity of BERT embeddings in several components, I wanted to open a discussion on how the issues highlighted in this paper might affect BERTopic and what best practices we should follow as a community.

Specifically:

To what extent do the arbitrary/non-unique cosine similarities due to regularization effects, as shown for linear models in the paper, also apply to the BERT embeddings used in BERTopic?
The paper suggests alternatives like training directly with a cosine objective, projecting embeddings back to the original space, or applying normalization/bias reduction techniques. Are any of these feasible or already implemented in BERTopic?
What additional validation steps should we as users take to ensure the cosine similarity scores in our BERTopic models are meaningful, given the issues raised?
Do the latest BERTopic features like Zero-Shot Topic Modeling and support for more LLMs help mitigate these issues in any way?

I'd be very interested to hear thoughts from @MaartenGr, as well as experiences and best practices from other users. How can we make sure we're using cosine similarity responsibly in our BERTopic workflows, in light of this research?

Thanks in advance for the discussion!

MaartenGr · 2024-05-23T16:08:29Z

MaartenGr
May 23, 2024
Maintainer

Thank you for sharing! This was discussed a while ago (see also this perspective) when it was first released and I agree with the general sentiment being said there. That is, although you can't arbitrarily use cosine similarity on all types of embeddings, it is generally okay to do it with embedding models as most of them are optimized for either cosine similarity or dot product (if the embeddings are used primarily for retrieval purposes). That is also the reason why I generally advise using representation models that are trained specifically to be an embedding model. Representation models, encoder-only models like BERT, are not the best embedding models unless they are trained/fine-tuned like one. As such, using representation foundation models like BERT, is not advised, whereas fine-tuned BERT models for embedding purposes typically are.

Either way, I would highly advise reading through the links above to get a feeling on where many stand (including the current maintainer of sentence-transformers).

1 reply

linxule May 23, 2024
Author

Thank you so much for sharing your thoughts on this. And I found the linked discussions quite informative. Really appreciate it!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about implications of "Is Cosine-Similarity of Embeddings Really About Similarity?" paper for BERTopic #2007

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Questions about implications of "Is Cosine-Similarity of Embeddings Really About Similarity?" paper for BERTopic #2007

linxule May 23, 2024

Replies: 1 comment · 1 reply

MaartenGr May 23, 2024 Maintainer

linxule May 23, 2024 Author

linxule
May 23, 2024

Replies: 1 comment 1 reply

MaartenGr
May 23, 2024
Maintainer

linxule May 23, 2024
Author