Topic Modeling (LDA)
Find the themes nobody labelled
Given a big pile of documents, topic modeling discovers the hidden themes that run through them — automatically, with no labels. It's unsupervised learning for text.
The most famous method is LDA — Latent Dirichlet Allocation. Its core assumption is two-sided: every document is a mixture of a few topics, and every topic is a distribution over words. LDA reverse-engineers both from the word counts alone.
Documents, topics, words
Watch LDA discover two topics from a small corpus, then express each document as a blend of those topics.
The two distributions
A "sports" topic puts high probability on goal, team, match; a "tech" topic on app, code, data. Topics are unnamed — you read the top words and name them.
A single article might be 70% sports, 30% business. Documents aren't forced into one bucket.
LDA starts with random topic assignments and iteratively reassigns each word's topic based on the document's mix and the topic's word profile (Gibbs sampling / variational inference) until the assignments settle.
Using it well
- Exploring a large unlabelled corpus
- Organizing news, reviews, support tickets
- Features for downstream models
- You must pick the number of topics K
- Topics need human interpretation
- Bag-of-words based — ignores order; modern embeddings (BERTopic) often do better