TopicTuner - HDBSCAN Tuning for BERTopic #788
Replies: 2 comments
-
@MaartenGr I have continued working on TopicTuner. It is now "round trip" you can start with a BERTopic model and produce one after tuning. It works stand-alone but is now better integrated. I have done a good amount of testing with multiple datasets and hundreds (probably thousands) of generated models including newsgroup, BBC, newsarticle and the "issues" from this repository. So far all my experiments (with the default UMAP/HDBSCAN BERTopic settings) have consistently shown the following:
While BERTopic currently provides solutions for these problems from what I can tell a tuned model will produce better results than what is currently available/easy to do. I have yet to see any situation where the downsides of tuning aren't offset by the resultant improvement of the model. Here is (I think) a good and compelling example using a randomly selected 2000 document subset of the BBC dataset. Here is the default BERTopic Model (all using TSNE 2D projections which in my experience and opinion are better for evaluation than UMAP 2D projections): After experimenting with TopicTuner I determined that the "best" number of topics for this corpus was 9. I then used the default BERTopic settings and set For this dataset, again in my opinion, this is a problematic configuration. A large number of documents are excluded - but not only are they excluded (not necessarily a problem) but the one's excluded have imbalanced the model. We can see this from a TopicTuned model where the least number of -1 categorized documents with 9 categories: The amount of work that went into tuning this was on the order of 5-10 minutes additional over the 5-10 minutes that it took to generate the default model in the first place. There are some significant differences, but the big and compelling one is how the default model created 4 and 7 in the I am actively seeking cases that will challenge what I've found so far. I'm not aware of any systematic way to show that this approach is better or worse than not using it. If I find cases that contradict the above I will let the community know and in the meantime if you or others have corpi that are causing modeling difficulties I would love to see what a bit of tuning would do to improve the situation. |
Beta Was this translation helpful? Give feedback.
-
I've released a new version of TopicTuner. Mainly notable because it is now registered with PyPi and can be directly installed via pip. Refactored the classes, added test cases, added some convenience functions, updated the docs. Take it for a spin on Colab. |
Beta Was this translation helpful? Give feedback.
-
If you would like an efficient solution for tuning HDBSCAN for your BERTopic models please take a look at TopicTuner. Using it you can drastically reduce the number of uncategorized documents as well as have an alternative method of selecting a given number of topics for your model. Feedback gratefully accepted.
Beta Was this translation helpful? Give feedback.
All reactions