Replies: 1 comment 1 reply
-
Thank you for sharing! This was discussed a while ago (see also this perspective) when it was first released and I agree with the general sentiment being said there. That is, although you can't arbitrarily use cosine similarity on all types of embeddings, it is generally okay to do it with embedding models as most of them are optimized for either cosine similarity or dot product (if the embeddings are used primarily for retrieval purposes). That is also the reason why I generally advise using representation models that are trained specifically to be an embedding model. Representation models, encoder-only models like BERT, are not the best embedding models unless they are trained/fine-tuned like one. As such, using representation foundation models like BERT, is not advised, whereas fine-tuned BERT models for embedding purposes typically are. Either way, I would highly advise reading through the links above to get a feeling on where many stand (including the current maintainer of sentence-transformers). |
Beta Was this translation helpful? Give feedback.
-
Hi BERTopic community,
I recently came across the paper "Is Cosine-Similarity of Embeddings Really About Similarity?" by Steck et al. (https://arxiv.org/abs/2403.05440) which raises some important points about the potential pitfalls of blindly using cosine similarity on embeddings derived from regularized models.
As a user of BERTopic, which relies on cosine similarity of BERT embeddings in several components, I wanted to open a discussion on how the issues highlighted in this paper might affect BERTopic and what best practices we should follow as a community.
Specifically:
I'd be very interested to hear thoughts from @MaartenGr, as well as experiences and best practices from other users. How can we make sure we're using cosine similarity responsibly in our BERTopic workflows, in light of this research?
Thanks in advance for the discussion!
Beta Was this translation helpful? Give feedback.
All reactions