Clarification on find_topics() #1789

owwdesilva · 2024-02-04T19:06:30Z

owwdesilva
Feb 4, 2024

I'm trying to detect some topics for a set of documents. My document is a list and it contains a combination of both words and sentences. When i change my document order, I'm getting different outputs from find_topics() method.

This is how my documents list looks like:

Version 1:
["hey y'all, it's Liz!", "Hey, today we're going to be making a real easy mandarin orange cake.", "The first thing you're going to need to do is pre-heat your oven to 350 degrees.", "You're also going to need a 9 by 13 greased pan.", "You're going to need one can mandarin and oranges, and this is a 15 ounce can.", "You're going to need a box of yellow cake mix.", "You're going to need 1/2 a cup of vegetable oil, and you're also going to need four eggs.", 'Okay, be right back.', 'Okay, so in our mixing bowl we have our cake mix and our mandarin oranges.', "We're also going to go ahead and pour in our vegetable oil and also our four eggs, and then we're just going to mix this up with a hand mixer and pour it in our pan.", "Okay, so now that I've got it all mixed up, what I'm going to do now is pour it in my 9 by 13 pan.", "We're going to cook this for 30 to 35 minutes.", "I'm going to start checking it at the 30 minute mark by inserting a skewer into the center of the cake to see if it comes out clean.", "Okay, I'll check back with you in just a little bit.", "Ok, so to make the topping for this cake, you're going to need a 20 ounce can of crushed pineapples that's been drained.", "You're also going to need a small package of jell-o instant pudding and pie filling, and you're going to mix these two together.", "Then you're going to fold in an 8 ounce tub of Cool Whip, and once you do that, when the cake comes out and it has completely cooled (I just lost my train of thought) completely cooled, then we're going to frost it with this that we just made up.", "Okay, I'll see you guys back here then.", 'Bye.', "Okay, so here's the cake.", 'I let it cook for 30 minutes and I tested it, and it was done.', "So now that I've let this cool for a minute or for like an hour, so it's cool, I'm going to go ahead and make the topping.", "And already I have the vanilla pudding inside, and I just added the pineapples, the draining pineapples, and we're just going to mix this to combined it, and then we're going to fold in all with topping.", "Now that the topping is all mixed up, we're just going to take it and spread it all over the cake, and then we're going to refrigerate it and be done.", 'Ok, so here is our easy orange mandarin cake.', "I'm going to stick this in the frigerator because I'm not ready to have a piece yet, but it was very easy to make.", "I hope you guys enjoy this recipe, and this is the last cake I'm going to make before vacation, and I will not make another video on desserts or anything like that until I get to San Antonio the second week in February, and I'll be filming from Wayne's house then.", 'Ok, see you guys later.', 'Bye.', '.', 'Candy', 'Food', 'Sweets', 'Lollipop', 'Toy', 'Plate', 'Food', 'Presentation', 'Cooking', 'Kitchen', 'Utensil', 'Macarons', 'Chopsticks', 'Cutlery', 'Spoon', 'Cup', 'Bowl', 'Scissors', 'Tape', 'Clothing', 'Coat']

Version 2:
[ 'Candy', 'Food', 'Sweets', 'Lollipop', 'Toy', 'Plate', 'Food', 'Presentation', 'Cooking', 'Kitchen', 'Utensil', 'Macarons', 'Chopsticks', 'Cutlery', 'Spoon', 'Cup', 'Bowl', 'Scissors', 'Tape', 'Clothing', 'Coat', "hey y'all, it's Liz!", "Hey, today we're going to be making a real easy mandarin orange cake.", "The first thing you're going to need to do is pre-heat your oven to 350 degrees.", "You're also going to need a 9 by 13 greased pan.", "You're going to need one can mandarin and oranges, and this is a 15 ounce can.", "You're going to need a box of yellow cake mix.", "You're going to need 1/2 a cup of vegetable oil, and you're also going to need four eggs.", 'Okay, be right back.', 'Okay, so in our mixing bowl we have our cake mix and our mandarin oranges.', "We're also going to go ahead and pour in our vegetable oil and also our four eggs, and then we're just going to mix this up with a hand mixer and pour it in our pan.", "Okay, so now that I've got it all mixed up, what I'm going to do now is pour it in my 9 by 13 pan.", "We're going to cook this for 30 to 35 minutes.", "I'm going to start checking it at the 30 minute mark by inserting a skewer into the center of the cake to see if it comes out clean.", "Okay, I'll check back with you in just a little bit.", "Ok, so to make the topping for this cake, you're going to need a 20 ounce can of crushed pineapples that's been drained.", "You're also going to need a small package of jell-o instant pudding and pie filling, and you're going to mix these two together.", "Then you're going to fold in an 8 ounce tub of Cool Whip, and once you do that, when the cake comes out and it has completely cooled (I just lost my train of thought) completely cooled, then we're going to frost it with this that we just made up.", "Okay, I'll see you guys back here then.", 'Bye.', "Okay, so here's the cake.", 'I let it cook for 30 minutes and I tested it, and it was done.', "So now that I've let this cool for a minute or for like an hour, so it's cool, I'm going to go ahead and make the topping.", "And already I have the vanilla pudding inside, and I just added the pineapples, the draining pineapples, and we're just going to mix this to combined it, and then we're going to fold in all with topping.", "Now that the topping is all mixed up, we're just going to take it and spread it all over the cake, and then we're going to refrigerate it and be done.", 'Ok, so here is our easy orange mandarin cake.', "I'm going to stick this in the frigerator because I'm not ready to have a piece yet, but it was very easy to make.", "I hope you guys enjoy this recipe, and this is the last cake I'm going to make before vacation, and I will not make another video on desserts or anything like that until I get to San Antonio the second week in February, and I'll be filming from Wayne's house then.", 'Ok, see you guys later.', 'Bye.', '.']

This is how I fetch the topics:
topic_model.find_topics(docs, top_n=10)

Can anyone help me to find why I'm getting two different set of topics when i input version1 & version2 as docs in to find_topics() method please ?

Answered by MaartenGr

Feb 18, 2024

However my requirement is not to generate topics for each document, i need to find out the most suitable topics (top 5 or 10) for this entire corpus.

It depends on what you mean with suitable but generally, you could predict the distribution of topics, using .transform, of each document and then aggregate these probabilities over the entire corpus. That way, you could the top n most probable topics across the entire corpus.

BTW,
I have observed when I pass my corpus as a single document (concatenating transcript + labels as a single document) into .find_topics() i was able to get set of topics and looks like those topics are almost aligned with my expected output too. But when I change…

View full answer

MaartenGr · 2024-02-05T07:30:10Z

MaartenGr
Feb 5, 2024
Maintainer

.find_topics is a method for finding a certain topic based on a single term/phrase and is not meant to use to predict the topics of documents. For that, I would advise using .transform instead.

9 replies

owwdesilva Feb 10, 2024
Author

Thank you for your response and the clear explanation @MaartenGr.

I have re trained my model, and able to resolve the issue. I was using a trained model for my testing, looks like i have missed to include prediction_data=True there. Sorry for the confusion.

However I was able to get a clear understanding on both .find_topics() and .transform(). Unfortunately I'm unable to use .transform() to achieve my requirement.

I'm trying to find high-level topics for a transcript while injecting set of labels. I have figure out that, it is not feasible to find topics for a combination of labels and docs. As an example, I have selected one YouTube video and extracted the transcript. Also i have extracted set of labels(objects, actions, etc.) from the video too. I was trying to use both transcript and labels data to find the overall context of the video.

Appreciate your suggestions/feedback If any...

MaartenGr Feb 10, 2024
Maintainer

Glad to hear that the initial issue was resolved!

I'm trying to find high-level topics for a transcript while injecting set of labels. I have figure out that, it is not feasible to find topics for a combination of labels and docs. As an example, I have selected one YouTube video and extracted the transcript. Also i have extracted set of labels(objects, actions, etc.) from the video too. I was trying to use both transcript and labels data to find the overall context of the video.

Could you go a bit deeper into what exactly is not working? You mention that it is not feasible but did not mention why it is not feasible to use .transform. The method expects some input document and attempts to predict the (distribution of) topic(s).

owwdesilva Feb 18, 2024
Author

Thanks for your continuous replies and the support.
As i understood, .transform is providing topics for each document in the corpus. In this case, document is either a sentence or a word.

Below is an example corpus align with my original data set.

[
    "Candy", 
    "Food", 
    "Kitchen",
    "Toy",
    "Plate", 
    "Food",
    "hey y'all, it's Liz!",
    "Hey, today we're going to be making a real easy mandarin orange cake.",
    "The first thing you're going to need to do is pre-heat your oven to 350 degrees.",
    "You're also going to need a 9 by 13 greased pan.",
    "You're going to need one can mandarin and oranges, and this is a 15 ounce can.",
    "You're going to need a box of yellow cake mix.",
    "Okay, so here's the cake.", 'I let it cook for 30 minutes and I tested it, and it was done.',
    "So now that I've let this cool for a minute or for like an hour, so it's cool, I'm going to go ahead and make the topping.",
    "And already I have the vanilla pudding inside, and I just added the pineapples, the draining pineapples, and we're just going to mix this to combined it, and then we're going to fold in all with topping.",
    "Now that the topping is all mixed up, we're just going to take it and spread it all over the cake, and then we're going to refrigerate it and be done.", 'Ok, so here is our easy orange mandarin cake.',
    "I'm going to stick this in the frigerator because I'm not ready to have a piece yet, but it was very easy to make.",
    "I hope you guys enjoy this recipe, and this is the last cake I'm going to make before vacation, and I will not make another video on desserts or anything like that until I get to San Antonio the second week in February, and I'll be filming from Wayne's house then."
]

This corpus has been generated combining transcript of a video and labels detected from video frames. In above data set, "Candy", "Food", "Kitchen", "Toy", "Plate", "Food" are the identified labels. and the rest is the sent tokenized transcript.

When I pass this corpus to detect topics using .transform, I'm getting a topic for each document.

However my requirement is not to generate topics for each document, i need to find out the most suitable topics (top 5 or 10) for this entire corpus.

BTW,
I have observed when I pass my corpus as a single document (concatenating transcript + labels as a single document) into .find_topics() i was able to get set of topics and looks like those topics are almost aligned with my expected output too. But when I change the order of concatenating the transcript & labels (Order A: transcript + labels, Order B: labels + transcript), I'm getting different outputs.

MaartenGr Feb 18, 2024
Maintainer

However my requirement is not to generate topics for each document, i need to find out the most suitable topics (top 5 or 10) for this entire corpus.

It depends on what you mean with suitable but generally, you could predict the distribution of topics, using .transform, of each document and then aggregate these probabilities over the entire corpus. That way, you could the top n most probable topics across the entire corpus.

BTW,
I have observed when I pass my corpus as a single document (concatenating transcript + labels as a single document) into .find_topics() i was able to get set of topics and looks like those topics are almost aligned with my expected output too. But when I change the order of concatenating the transcript & labels (Order A: transcript + labels, Order B: labels + transcript), I'm getting different outputs.

This is a result of the method underlying .find_topics. It creates an embedding of the input so changing the order is likely to change the embedding.

Answer selected by owwdesilva

owwdesilva Feb 21, 2024
Author

Thank you again for the clarifications. Much appreciated.

I have one other question regarding the way I used the BERTopic model. I trained the BERTopic model on a large corpus(eg.10000 articles), saved it as a model first, and subsequently provided a small corpus (eg. a video transcript + labels) to find topics within it by loading the trained model. However, after reviewing the BERTopic documents, it seems that this training step is not required. I was also able to get results directly by fitting the corpus to the BERTopic model (seem like they could be further improved by hyperparameter tuning). Is my understanding correct, and at what stages should we train a BERTopic model and save it?

MaartenGr Feb 22, 2024
Maintainer

Generally, you want to train BERTopic on as much data as you can if possible. That often results in the best performance since the model has more context to learn from. Then, you can save the model and use for inference if you have incoming documents.

If you want online learning, I would advise using either the online approach or the merging approach.

owwdesilva Feb 25, 2024
Author

Thank you Maarten, this is very helpful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification on find_topics() #1789

{{title}}

Replies: 1 comment 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Clarification on find_topics() #1789

owwdesilva Feb 4, 2024

Replies: 1 comment · 9 replies

MaartenGr Feb 5, 2024 Maintainer

owwdesilva Feb 10, 2024 Author

MaartenGr Feb 10, 2024 Maintainer

owwdesilva Feb 18, 2024 Author

MaartenGr Feb 18, 2024 Maintainer

owwdesilva Feb 21, 2024 Author

MaartenGr Feb 22, 2024 Maintainer

owwdesilva Feb 25, 2024 Author

owwdesilva
Feb 4, 2024

Replies: 1 comment 9 replies

MaartenGr
Feb 5, 2024
Maintainer

owwdesilva Feb 10, 2024
Author

MaartenGr Feb 10, 2024
Maintainer

owwdesilva Feb 18, 2024
Author

MaartenGr Feb 18, 2024
Maintainer

owwdesilva Feb 21, 2024
Author

MaartenGr Feb 22, 2024
Maintainer

owwdesilva Feb 25, 2024
Author