AINLP

BERTopic is a topic modelling approach powered by BERT.

https://maartengr.github.io/BERTopic/index.html#fine-tune-topic-representations

Running BERTopic

 
topic_model = bertopic.BERTopic(language="english", top_n_words=10, n_gram_range=(1,2), min_topic_size=10, calculate_probabilities=True)
 
  • It’s best not to set a nr_topics and let the hierarchichal clustering algorithm work that out.

Pre-calculating embeddings

The authors propose in their best practices documentation that you pre-calculate your document embeddings.

They recommend trying models from the MTEB Leaderboard.

from sentence_transformers import SentenceTransformer
 
 
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedding_model.encode(documents, show_progress_bar=True)
 
# ... initialise bertopic model 
 
topics, probs =  topic_model.fit_transform(documents, embeddings=embeddings)

Using Nomic Embedding Model

  • Requires trust_remote_code=True which is scary
  • Also requires pip install einops

Customising Representation and Stopword removal

The best practices doc recommends customising the CountVectorizer instance used by the BertTopic instance to remove stopwords:

 
from sklearn.feature_extraction.text import CountVectorizer
 
vectorizer_model = CountVectorizer(stop_words="english", min_df=2, ngram_range=(1, 2))