Grouping conversation together and assigning them topics #1945

yksoni · 2024-04-24T21:44:41Z

yksoni
Apr 24, 2024

Hi

I am currently exploring BERTopic for a new project.

I am trying to group together messages that are part of a conversation and generate the topic of the conversation. So if we have a sequence of 100 messages exchanged between 5 people, they talk about something, and then go on a tangent and talk about something else. There might be questions, answers, statements, etc.

It seems that in BERTopic modeling, one of the steps is clustering. BERTopic clusters messages based on semantic similarity. (https://maartengr.github.io/BERTopic/getting_started/clustering/clustering.html)

Based on some experiments, it seems it won't work for my use case. The reason being, although the question and answer are part of a natural flow of conversation, they may have low semantic similarity.

Consider the following.
sentence1 = "What is your name?"
sentence2 = "robert"

The cosine similarity score between the 2 sentences is : 0.3524

This is a low score and BERTopic will assign it in different topics. Even though the question and answer have low semantic similarity, they are part of a single conversation/topic.

Do I have the right understanding of BERTopic topic modeling and clustering? If yes, what are my alternatives?

I appreciate any help!

Thank You.
Yogesh

Answered by MaartenGr

Apr 30, 2024

Hi Yogesh,

Thanks for sharing your use case. This indeed seems quite difficult when you are using an embedding model that is trained for semantic similarity.

It will depend on the extent to which you want to generate topics. How abstract or specific they need to be as there are tricks to still combine the topics. For instance, you could simply combine a number of sentences using a sliding window (e.g., sentence1 + sentence2, sentence2 + sentence3, etc.) and feed those into BERTopic. That will increase semantic content of your documents and might result in better representations.

The second thing, which I am not familiar with myself, is to look into graph-based clustering methods. It seems…

View full answer

MaartenGr · 2024-04-30T07:18:32Z

MaartenGr
Apr 30, 2024
Maintainer

Hi Yogesh,

Thanks for sharing your use case. This indeed seems quite difficult when you are using an embedding model that is trained for semantic similarity.

It will depend on the extent to which you want to generate topics. How abstract or specific they need to be as there are tricks to still combine the topics. For instance, you could simply combine a number of sentences using a sliding window (e.g., sentence1 + sentence2, sentence2 + sentence3, etc.) and feed those into BERTopic. That will increase semantic content of your documents and might result in better representations.

The second thing, which I am not familiar with myself, is to look into graph-based clustering methods. It seems that the structure of your data might not relate well to embedding-like models due to its graph-like structure. So using an algorithm that models the data better could be the solution here.

Cheers,
Maarten

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grouping conversation together and assigning them topics #1945

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Grouping conversation together and assigning them topics #1945

yksoni Apr 24, 2024

Replies: 1 comment

MaartenGr Apr 30, 2024 Maintainer

yksoni
Apr 24, 2024

MaartenGr
Apr 30, 2024
Maintainer