TF-IDF · Suman Bhadra Notes

Not all words are equally useful

In Bag of Words, "the" might appear 20 times and "asteroid" once — yet "asteroid" tells you far more about the document. TF-IDF fixes this by weighting words by how distinctive they are.

It multiplies two ideas: how often a word appears in this document (TF), and how rare the word is across all documents (IDF). A word that's frequent here but rare everywhere else gets a high score — that's a keyword.

The two pieces

TF — term frequency count in this doc

How many times the word appears in the document (often divided by document length).

IDF — inverse document frequency log(N / dfₜ)

N = total documents, dfₜ = documents containing the word. Rare word → big IDF; ubiquitous word → near zero.

TF-IDF TF × IDF

The product. High only when the word is frequent here and rare elsewhere.

Walk the formula

Across 4 documents, watch "the" get crushed by a near-zero IDF while a distinctive word keeps a high weight.

Why the log in IDF?

Diminishing returns

The log keeps rare-word weights from blowing up. Going from a word in 1 doc to 2 docs matters a lot; going from 1,000 docs to 1,001 barely moves the needle. The log captures that curve, and the "+1" smoothing avoids dividing by zero.

Common formula

tfidf(t, d) = tf(t, d) × log( N / (1 + dfₜ) ) — then the document's vector is usually L2-normalized.

Here is that curve, live, for a corpus of N = 100 documents. Slide df along it and feel the shape: brutal at the left (1 doc vs 10 docs is a huge drop), nearly flat at the right (60 vs 100 barely registers). The second slider sets the word's count in your document, and the bars multiply the two.

df — docs containing it tf — count in this doc

Put df = 100 ("the"): the weight collapses to zero no matter how big tf gets — frequency in one document cannot rescue a word that is everywhere. At df = 2 ("asteroid"), even tf = 2 outscores "the" at tf = 10. That asymmetry is the whole point of the log.

Where it shines — and its limits

Great for

Search ranking & keyword extraction
A strong text-classification baseline
Document similarity with cosine similarity

Still limited

Ignores word order (mix with n-grams)
No sense of meaning — synonyms stay separate
Sparse, high-dimensional vectors