-
Hi I am currently exploring BERTopic for a new project. I am trying to group together messages that are part of a conversation and generate the topic of the conversation. So if we have a sequence of 100 messages exchanged between 5 people, they talk about something, and then go on a tangent and talk about something else. There might be questions, answers, statements, etc. It seems that in BERTopic modeling, one of the steps is clustering. BERTopic clusters messages based on semantic similarity. (https://maartengr.github.io/BERTopic/getting_started/clustering/clustering.html) Based on some experiments, it seems it won't work for my use case. The reason being, although the question and answer are part of a natural flow of conversation, they may have low semantic similarity. Consider the following. The cosine similarity score between the 2 sentences is : 0.3524 This is a low score and BERTopic will assign it in different topics. Even though the question and answer have low semantic similarity, they are part of a single conversation/topic. Do I have the right understanding of BERTopic topic modeling and clustering? If yes, what are my alternatives? I appreciate any help! Thank You. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Hi Yogesh, Thanks for sharing your use case. This indeed seems quite difficult when you are using an embedding model that is trained for semantic similarity. It will depend on the extent to which you want to generate topics. How abstract or specific they need to be as there are tricks to still combine the topics. For instance, you could simply combine a number of sentences using a sliding window (e.g., sentence1 + sentence2, sentence2 + sentence3, etc.) and feed those into BERTopic. That will increase semantic content of your documents and might result in better representations. The second thing, which I am not familiar with myself, is to look into graph-based clustering methods. It seems that the structure of your data might not relate well to embedding-like models due to its graph-like structure. So using an algorithm that models the data better could be the solution here. Cheers, |
Beta Was this translation helpful? Give feedback.
Hi Yogesh,
Thanks for sharing your use case. This indeed seems quite difficult when you are using an embedding model that is trained for semantic similarity.
It will depend on the extent to which you want to generate topics. How abstract or specific they need to be as there are tricks to still combine the topics. For instance, you could simply combine a number of sentences using a sliding window (e.g., sentence1 + sentence2, sentence2 + sentence3, etc.) and feed those into BERTopic. That will increase semantic content of your documents and might result in better representations.
The second thing, which I am not familiar with myself, is to look into graph-based clustering methods. It seems…