N-grams

NLP features bigram trigram context

Putting a little order back

An n-gram is a contiguous sequence of n tokens. Instead of single words, you slide a window over the text and treat each short run of words as a feature.

This is the cheap fix for Bag of Words' biggest flaw: it forgets order. With unigrams alone, "not good" and "good" look similar. Add bigrams and "not good" becomes its own feature — order partly restored.

The names

n = 1 unigram (one word) · n = 2 bigram (two) · n = 3 trigram (three).

Slide the window

Watch the same sentence chopped into unigrams, then bigrams, then trigrams — and see how "not" + "good" become a single, meaningful bigram.

Why and why not

Adding n-grams helps
  • Captures phrases: "new york", "machine learning"
  • Keeps negation: "not good" ≠ "good"
  • More context than lone words
But beware
  • Vocabulary explodes — far more features
  • Higher n → very sparse, rare combinations
  • More risk of overfitting and slower training
In practice

Unigrams + bigrams (an "n-gram range of 1–2") is a common sweet spot. Pair n-grams with TF-IDF weighting and feed into a classifier. Trigrams and beyond are used sparingly.