TopicTuner - HDBSCAN Tuning for BERTopic #788

drob-xx · 2022-10-19T22:56:37Z

drob-xx
Oct 19, 2022

If you would like an efficient solution for tuning HDBSCAN for your BERTopic models please take a look at TopicTuner. Using it you can drastically reduce the number of uncategorized documents as well as have an alternative method of selecting a given number of topics for your model. Feedback gratefully accepted.

drob-xx · 2022-11-15T17:05:09Z

drob-xx
Nov 15, 2022
Author

@MaartenGr I have continued working on TopicTuner. It is now "round trip" you can start with a BERTopic model and produce one after tuning. It works stand-alone but is now better integrated.

I have done a good amount of testing with multiple datasets and hundreds (probably thousands) of generated models including newsgroup, BBC, newsarticle and the "issues" from this repository. So far all my experiments (with the default UMAP/HDBSCAN BERTopic settings) have consistently shown the following:

Tuning for optimized (defined as lowest number of -1 categorized documents within the desired number of document clusters) results always (so far) in what I believe are superior topic clusters. As you know, measuring this is essentially subjective, but the results I have seen so far are pretty clear and compelling (I have included a representative example below). In other posts I have provided examples as well as running code.
IMO TopicTuner provides a great solution to some recurring issues that BERTopic users often bring up in the issues/discussions queue: 1) the selection of a "good" number of topics; and 2) the minimization of the -1 category. I haven't tested extensively but in quite a few cases I have also been able to get very nice models with as few as a couple of dozen documents - something that has come up now and again in the discussions.

While BERTopic currently provides solutions for these problems from what I can tell a tuned model will produce better results than what is currently available/easy to do. I have yet to see any situation where the downsides of tuning aren't offset by the resultant improvement of the model.

Here is (I think) a good and compelling example using a randomly selected 2000 document subset of the BBC dataset. Here is the default BERTopic Model (all using TSNE 2D projections which in my experience and opinion are better for evaluation than UMAP 2D projections):

After experimenting with TopicTuner I determined that the "best" number of topics for this corpus was 9. I then used the default BERTopic settings and set nr_topic=9 resulting in:

For this dataset, again in my opinion, this is a problematic configuration. A large number of documents are excluded - but not only are they excluded (not necessarily a problem) but the one's excluded have imbalanced the model. We can see this from a TopicTuned model where the least number of -1 categorized documents with 9 categories:

The amount of work that went into tuning this was on the order of 5-10 minutes additional over the 5-10 minutes that it took to generate the default model in the first place.

There are some significant differences, but the big and compelling one is how the default model created 4 and 7 in the nr_topics version. Those topics are based on very small samples from a much larger cluster which in the tuned model (cluster 0) is completely missing from the default model. In my opinion the tuned model is a better representation than the default. There are other significant differences as well - but this one is, again in my opinion, sufficient to make the point. This example is entirely consistent with what I've found over months of investigation.

I am actively seeking cases that will challenge what I've found so far. I'm not aware of any systematic way to show that this approach is better or worse than not using it. If I find cases that contradict the above I will let the community know and in the meantime if you or others have corpi that are causing modeling difficulties I would love to see what a bit of tuning would do to improve the situation.

0 replies

drob-xx · 2023-01-12T18:43:00Z

drob-xx
Jan 12, 2023
Author

I've released a new version of TopicTuner. Mainly notable because it is now registered with PyPi and can be directly installed via pip. Refactored the classes, added test cases, added some convenience functions, updated the docs. Take it for a spin on Colab.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TopicTuner - HDBSCAN Tuning for BERTopic #788

{{title}}

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

TopicTuner - HDBSCAN Tuning for BERTopic #788

drob-xx Oct 19, 2022

Replies: 2 comments

drob-xx Nov 15, 2022 Author

drob-xx Jan 12, 2023 Author

drob-xx
Oct 19, 2022

drob-xx
Nov 15, 2022
Author

drob-xx
Jan 12, 2023
Author