TF-IDF
Not all words are equally useful
In Bag of Words, "the" might appear 20 times and "asteroid" once — yet "asteroid" tells you far more about the document. TF-IDF fixes this by weighting words by how distinctive they are.
It multiplies two ideas: how often a word appears in this document (TF), and how rare the word is across all documents (IDF). A word that's frequent here but rare everywhere else gets a high score — that's a keyword.
The two pieces
How many times the word appears in the document (often divided by document length).
N = total documents, dfₜ = documents containing the word. Rare word → big IDF; ubiquitous word → near zero.
The product. High only when the word is frequent here and rare elsewhere.
Walk the formula
Across 4 documents, watch "the" get crushed by a near-zero IDF while a distinctive word keeps a high weight.
Why the log in IDF?
The log keeps rare-word weights from blowing up. Going from a word in 1 doc vs 2 docs matters a lot; 1,000 vs 2,000 barely matters. The log captures that curve, and the "+1" smoothing avoids dividing by zero.
tfidf(t, d) = tf(t, d) × log( N / (1 + dfₜ) ) — then the document's vector is usually L2-normalized.
Here is that curve, live, for a corpus of N = 100 documents. Slide df along it and feel the shape: brutal at the left (1 doc vs 10 docs is a huge drop), nearly flat at the right (60 vs 100 barely registers). The second slider sets the word's count in your document, and the bars multiply the two.
Put df = 100 ("the"): the weight collapses to zero no matter how big tf gets — frequency in one document cannot rescue a word that is everywhere. At df = 2 ("asteroid"), even tf = 2 outscores "the" at tf = 10. That asymmetry is the whole point of the log.
Where it shines — and its limits
- Search ranking & keyword extraction
- A strong text-classification baseline
- Document similarity with cosine similarity
- Ignores word order (mix with n-grams)
- No sense of meaning — synonyms stay separate
- Sparse, high-dimensional vectors
See the head-to-head in BoW vs TF-IDF, then jump to dense embeddings to add meaning.