Recently, pretrained transformer models like BERT (Devlin et al., 2019) have shown great performances to generate sentence embeddings for text clustering (Reimers and Gurevych, 2019). Since then, a lot of clustering approaches leveraged the dense vector representations generated by pre-trained bidirectional transformers. Khaustov et al. (2020) proposed two methods based on BERT for news clustering. Subakti et al. (2020) compared BERT and TF-IDF as data representation of text clustering and found out that BERT outperformed TF-IDF on most of the metrics explored. Shi et al. (2021) developed a self-supervised document clustering approach based on BERT. Mehta et al. (2021) combined TF-IDF and BERT to improve clustering performance on large datasets.
In this part, we introduce a method that combines dense vector representations with sparse vectors to improve the interpretability of clustering results. To the best of our knowledge, only BERTopic (Grootendorst, 2022) seems to combine SBERT (Reimers and Gurevych, 2019) and TF-IDF for sentence embeddings and cluster explainability respectively. They used pre-trained transformer-based language models and a class-based TF-IDF to generate document embeddings and topic representations. Our approach also combines the performance of pre-trained transformers and TF-IDF, but we use TF-IDF to create relevant features to improve the interpretability of cluster visualizations.
Inspired by the topic model BERTopic (Grootendorst, 2022), we used SBERT (Reimers and Gurevych, 2019) to convert documents into dense vector representations. Then, we used those embeddings to cluster semantically similar documents. To make the results of our clustering interpretable, we applied TF-IDF to generate the features for our word cloud visualizations.
We applied the same preprocessing steps as in the TF-IDF configuration, except that we didn’t tokenize, lemmatize, or remove stop-words. The concatenation of tweets resulted in the creation of documents of more than 512 tokens (the maximum context length of SBERT). Therefore, we splitted each document into chunks of 300 words. Each chunk was converted into a dense vector using SBERT. To reduce the dimensionality, we experimented with PCA, t-SNE and UMAP. We decided to apply UMAP for two reasons, it performed best in terms of purity/silhouette score and Allaoui et al (2020) revealed that it could improve K-means performance. As we wanted our results to be comparable with the bag-of-words model, we applied the best TF-IDF clustering configuration (k-means clustering with random starts, cosine metric and six clusters). To come back to the data preprocessing set up used in the previous method, we aggregated all chunks of tweets from each author and used a majority vote to assign them to a specific cluster.
Figure x summarizes the different steps involved in the preprocessing and modeling of the SBERT configuration. Furthermore, average silhouette scores and purity scores are presented in Table x. Based on those metrics, vectors generated by SBERT lead to higher quality clusters than the ones generated using TF-IDF. In terms of silhouette score, the clusters are denser and better separated in the dense vector configuration than using sparse vectors. It suggests a higher quality of clusters (better cohesion and separation). In terms of purity, there is 6 points difference between the SBERT configuration and the TFIDF one. It suggests that the contextualized word representations generated by SBERT correlates better with the labels of our dataset (without any fine-tuning).
Fig : Preprocessing and modeling step for SBERT set up.
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
silhouette | 0.49 | 0.62 | 0.46 | 0.70 | 0.80 | 0.46 |
purity | 0.92 | 0.65 | 0.95 | 0.78 | 0.66 | 0.80 |
Table x: Evaluation of clustering results using purity and Silhouette score using SBERT to encode the documents
The embeddings generated by SBERT cannot directly be used to interpret the different clusters. To create word cloud visualizations, we transformed the tweets into a bag-of-words model with TF-IDF frequencies and then applied the methodology described in section 2 (using the newly generated clusters). Figure y shows the word cloud visualizations generated using the SBERT configuration.
Fig y: Word cloud visualizations generated for alternative clustering configurations (all tweets, 100 features, SBERT to create the dense vector representation) of the Russian Trolls dataset. The top visualization shows word clouds from average values, the bottom word clouds from z-scores.
We compared the word clouds generated using the two structures. Word cloud visualizations are very similar. This suggest that samples that compose each cluster using sparse vectors and dense vectors are mostly the same. For example, cluster 2 (SBERT) and cluster 4 (TFIDF) convey approximately the same information and lead to the same interpretation. Only cluster 0 (SBERT) and cluster 3 (TFIDF) send substantially different results and thus different interpretations (assuming that those are the two clusters that do not have a match in the other cluster).
To summarize, we introduced a method that combines dense vector representations with sparse vectors to improve the interpretability of clustering results. We used SBERT to embed each document into a dense vector and applied UMAP to reduce the dimensionality of those vectors. To cluster each document, we used K-means and then applied TF-IDF on those clusters to generate the features for our word cloud visualizations.
- z. https://www.sbert.net/examples/applications/computing-embeddings/README.html
- Allaoui, Mebarka, Mohammed Lamine Kherfi, and Abdelhakim Cheriet. "Considerably improving clustering algorithms using UMAP dimensionality reduction technique: a comparative study." International Conference on Image and Signal Processing. Springer, Cham, 2020.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, 2019.
- BERT for Russian news clustering / Khaustov, Gorlova, Kalmykov, Kabaev // Computational Linguistics and Intellectual Technologies: Papers from the Annual Conference “Dialogue”. –– Vol. XX. –– 2021. –– P. xx–xx.
- Shi H, Wang C, Sakai T (2020) Self-supervised document clustering based on bert with data augment. arXiv preprint arXiv:2011.08523 44
- Grootendorst, Maarten. "BERTopic: Neural topic modeling with a class-based TF-IDF procedure." arXiv preprint arXiv:2203.05794 (2022).
- Mehta, Vivek, Seema Bawa, and Jasmeet Singh. "WEClustering: word embeddings based text clustering technique for large datasets." Complex & Intelligent Systems 7.6 (2021): 3211-3224.
- Reimers, N., and I. Sentence-BERT Gurevych. "Sentence Embeddings using Siamese BERT-Networks. arXiv 2019." arXiv preprint arXiv:1908.10084 (1908).