N-grams · Suman Bhadra Notes

Putting a little order back

An n-gram is a contiguous sequence of n tokens. Instead of single words, you slide a window over the text and treat each short run of words as a feature.

This is the cheap fix for Bag of Words' biggest flaw: it forgets order. With unigrams alone, "not good" and "good" look similar. Add bigrams and "not good" becomes its own feature — order partly restored.

The names

n = 1 unigram (one word) · n = 2 bigram (two) · n = 3 trigram (three).

Slide the window

Watch the same sentence chopped into unigrams, then bigrams, then trigrams — and see how "not" + "good" become a single, meaningful bigram.

Why and why not

Adding n-grams helps

Captures phrases: "new york", "machine learning"
Keeps negation: "not good" ≠ "good"
More context than lone words

But beware

Vocabulary explodes — far more features
Higher n → very sparse, rare combinations
More risk of overfitting and slower training

In practice

Unigrams + bigrams (an "n-gram range of 1–2") is a common sweet spot. Pair n-grams with TF-IDF weighting and feed into a classifier. Trigrams and beyond are used sparingly.