BERTopic is a topic modelling approach powered by BERT.
https://maartengr.github.io/BERTopic/index.html#fine-tune-topic-representations
Running BERTopic
topic_model = bertopic.BERTopic(language="english", top_n_words=10, n_gram_range=(1,2), min_topic_size=10, calculate_probabilities=True)
- It’s best not to set a
nr_topics
and let the hierarchichal clustering algorithm work that out.
Pre-calculating embeddings
The authors propose in their best practices documentation that you pre-calculate your document embeddings.
They recommend trying models from the MTEB Leaderboard.
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedding_model.encode(documents, show_progress_bar=True)
# ... initialise bertopic model
topics, probs = topic_model.fit_transform(documents, embeddings=embeddings)
Using Nomic Embedding Model
- Requires
trust_remote_code=True
which is scary - Also requires
pip install einops
Customising Representation and Stopword removal
The best practices doc recommends customising the CountVectorizer
instance used by the BertTopic
instance to remove stopwords:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(stop_words="english", min_df=2, ngram_range=(1, 2))