N-grams
Putting a little order back
An n-gram is a contiguous sequence of n tokens. Instead of single words, you slide a window over the text and treat each short run of words as a feature.
This is the cheap fix for Bag of Words' biggest flaw: it forgets order. With unigrams alone, "not good" and "good" look similar. Add bigrams and "not good" becomes its own feature — order partly restored.
The names
n = 1 unigram (one word) · n = 2 bigram (two) · n = 3 trigram (three).
Slide the window
Watch the same sentence chopped into unigrams, then bigrams, then trigrams — and see how "not" + "good" become a single, meaningful bigram.
Why and why not
Adding n-grams helps
- Captures phrases: "new york", "machine learning"
- Keeps negation: "not good" ≠ "good"
- More context than lone words
But beware
- Vocabulary explodes — far more features
- Higher n → very sparse, rare combinations
- More risk of overfitting and slower training
In practice
Unigrams + bigrams (an "n-gram range of 1–2") is a common sweet spot. Pair n-grams with TF-IDF weighting and feed into a classifier. Trigrams and beyond are used sparingly.