Stemming
Collapse a word family to one root
"run", "runs", "running", "ran" all mean roughly the same thing — but to a bag-of-words model they're four different tokens. Stemming chops off endings to fold them toward a single root.
It works by applying crude suffix-stripping rules — remove -ing, -ed, -s, -ly — without knowing any grammar or dictionary. Fast and simple, but rough around the edges.
The Porter stemmer (1980) is the most famous, a cascade of rule phases. Its successor, Snowball, refines it and supports many languages.
Watch the suffixes fall
A family of related words gets stemmed to a shared root, shrinking four tokens into one — plus a look at how stemming can produce non-words.
The trade-off
- Very fast — just string rules, no dictionary
- Shrinks the vocabulary a lot
- Great for search and information retrieval
- Stems are often not real words (studies → studi)
- Over-stemming merges unrelated words (universe, university → univers)
- Under-stemming misses (ran ≠ run)
When you need real dictionary words and grammatical awareness, use Lemmatization instead. The comparison article shows exactly when to pick which.