Stemming

NLP normalization Porter stemmer preprocessing

Collapse a word family to one root

"run", "runs", "running", "ran" all mean roughly the same thing — but to a bag-of-words model they're four different tokens. Stemming chops off endings to fold them toward a single root.

It works by applying crude suffix-stripping rules — remove -ing, -ed, -s, -ly — without knowing any grammar or dictionary. Fast and simple, but rough around the edges.

The classic

The Porter stemmer (1980) is the most famous, a cascade of rule phases. Its successor, Snowball, refines it and supports many languages.

Watch the suffixes fall

A family of related words gets stemmed to a shared root, shrinking four tokens into one — plus a look at how stemming can produce non-words.

The trade-off

Upside
  • Very fast — just string rules, no dictionary
  • Shrinks the vocabulary a lot
  • Great for search and information retrieval
Downside
  • Stems are often not real words (studies → studi)
  • Over-stemming merges unrelated words (universe, university → univers)
  • Under-stemming misses (ran ≠ run)
The smarter cousin

When you need real dictionary words and grammatical awareness, use Lemmatization instead. The comparison article shows exactly when to pick which.