Stemming · Suman Bhadra Notes

Collapse a word family to one root

"run", "runs", "running", "ran" all mean roughly the same thing — but to a bag-of-words model they're four different tokens. Stemming chops off endings to fold them toward a single root.

It works by applying crude suffix-stripping rules — remove -ing, -ed, -s, -ly — without knowing any grammar or dictionary. Fast and simple, but rough around the edges.

The classic

The Porter stemmer (1980) is the most famous, a cascade of rule phases. Its successor, Snowball, refines it and supports many languages.

Watch the suffixes fall

A family of related words gets stemmed to a shared root, collapsing two of the four tokens into one shared stem — plus a look at how stemming can produce non-words.

The trade-off

Upside

Very fast — just string rules, no dictionary
Shrinks the vocabulary a lot
Great for search and information retrieval

Downside

Stems are often not real words (studies → studi)
Over-stemming merges unrelated words (universe, university → univers)
Under-stemming misses (ran ≠ run)

The smarter cousin

When you need real dictionary words and grammatical awareness, use Lemmatization instead. The comparison article shows exactly when to pick which.