BoW vs TF-IDF
Counts vs weighted counts
Bag of Words and TF-IDF produce vectors over the same vocabulary. The only difference: BoW stores raw counts; TF-IDF stores counts weighted by distinctiveness.
That one change reshapes the vector — common words that BoW inflates get pushed down, and rare, telling words rise. Same document, very different emphasis.
Same document, two vectors
Watch one document become a BoW count vector, then a TF-IDF vector — and see the common word collapse while the distinctive word stays tall.
Side by side
- Stores raw counts
- Common words dominate
- Dead simple, fully interpretable
- Integer values
- Counts weighted by rarity
- Common words down-weighted
- Highlights distinctive keywords
- Continuous values, usually normalized
Which should you use?
Quick experiments, or with Naive Bayes (which models counts naturally). Easiest to explain.
Search, document similarity, and most linear classifiers. The stronger default for classic NLP.
Both are sparse, order-blind (until you add n-grams), and meaning-blind — "good" and "great" remain unrelated dimensions in either scheme.