BERTopic is a topic modelling approach powered by BERT.
https://maartengr.github.io/BERTopic/index.html#fine-tune-topic-representations
Running BERTopic
- It’s best not to set a
nr_topics
and let the hierarchichal clustering algorithm work that out.
Pre-calculating embeddings
The authors propose in their best practices documentation that you pre-calculate your document embeddings.
They recommend trying models from the MTEB Leaderboard.
Using Nomic Embedding Model
- Requires
trust_remote_code=True
which is scary - Also requires
pip install einops
Customising Representation and Stopword removal
The best practices doc recommends customising the CountVectorizer
instance used by the BertTopic
instance to remove stopwords: