BoW vs TF-IDF · Suman Bhadra Notes

Counts vs weighted counts

Bag of Words and TF-IDF produce vectors over the same vocabulary. The only difference: BoW stores raw counts; TF-IDF stores counts weighted by distinctiveness.

That one change reshapes the vector — common words that BoW inflates get pushed down, and rare, telling words rise. Same document, very different emphasis.

Same document, two vectors

Watch one document become a BoW count vector, then a TF-IDF vector — and see the common word collapse while the distinctive word stays tall.

Side by side

Bag of Words

Stores raw counts
Common words dominate
Dead simple, fully interpretable
Integer values

TF-IDF

Counts weighted by rarity
Common words down-weighted
Highlights distinctive keywords
Continuous values, usually normalized

Which should you use?

Use BoW simple baseline

Quick experiments, or with Naive Bayes (which models counts naturally). Easiest to explain.

Use TF-IDF usually better

Search, document similarity, and most linear classifiers. The stronger default for classic NLP.

Use neither go dense

When you need meaning and similarity, move to embeddings or transformers.

Shared limitation

Both are sparse, order-blind (until you add n-grams), and meaning-blind — "good" and "great" remain unrelated dimensions in either scheme.