Why Word Embeddings? · Suman Bhadra Notes

The meaning-blind problem

To Bag of Words and TF-IDF, every word is a separate dimension. "good" and "great" are as unrelated as "good" and "refrigerator". The representation captures which words, never what they mean.

Word embeddings fix this. Each word becomes a short dense vector of real numbers — typically 100–300 of them — learned so that words used in similar contexts get similar vectors. Meaning becomes geometry.

The distributional hypothesis

"You shall know a word by the company it keeps" (J.R. Firth, 1957). Words appearing in similar contexts (king/queen near "throne", "royal") end up near each other in vector space.

From sparse one-hots to a meaning map

The animation contrasts sparse one-hot vectors with a dense embedding space — where related words cluster and even analogies become vector arithmetic.

What you gain

Similarity close = related

"happy" and "joyful" land near each other; cosine similarity measures it.

Compactness 300 vs 50,000

A dense 300-dim vector replaces a 50,000-long sparse one-hot.

Analogies vector arithmetic

king − man + woman ≈ queen. Relationships are directions in the space.

Transfer pretrained

Reuse embeddings trained on billions of words to boost small datasets.

The methods

Several ways to learn these vectors, each in this track:

Word2Vec predict context

Learn from a sliding window — see Word2Vec.

GloVe co-occurrence

Factorize a global word co-occurrence matrix — see GloVe.

Contextual BERT-style

One vector per occurrence, so "bank" differs by context — the modern frontier.

One caveat

Classic embeddings give a word one static vector regardless of context, so "river bank" and "money bank" collide. Contextual models (BERT, GPT) fix this by producing a different vector for the same word in each context.