Topic Modeling (LDA) · Suman Bhadra Notes

Find the themes nobody labelled

Given a big pile of documents, topic modeling discovers the hidden themes that run through them — automatically, with no labels. It's unsupervised learning for text.

The most famous method is LDA — Latent Dirichlet Allocation. Its core assumption is two-sided: every document is a mixture of a few topics, and every topic is a distribution over words. LDA reverse-engineers both from the word counts alone.

Documents, topics, words

Watch LDA discover two topics from a small corpus, then express each document as a blend of those topics.

The two distributions

Topic → words what a topic is

A "sports" topic puts high probability on goal, team, match; a "tech" topic on app, code, data. Topics are unnamed — you read the top words and name them.

Document → topics a blend

A single article might be 70% sports, 30% business. Documents aren't forced into one bucket.

How it learns

LDA starts with random topic assignments and iteratively reassigns each word's topic based on the document's mix and the topic's word profile (Gibbs sampling / variational inference) until the assignments settle.

Using it well

Great for

Exploring a large unlabelled corpus
Organizing news, reviews, support tickets
Features for downstream models

Caveats

You must pick the number of topics K
Topics need human interpretation
Bag-of-words based — ignores order; modern embeddings (BERTopic) often do better