Why Word Embeddings?
The meaning-blind problem
To Bag of Words and TF-IDF, every word is a separate dimension. "good" and "great" are as unrelated as "good" and "refrigerator". The representation captures which words, never what they mean.
Word embeddings fix this. Each word becomes a short dense vector of real numbers — typically 100–300 of them — learned so that words used in similar contexts get similar vectors. Meaning becomes geometry.
"You shall know a word by the company it keeps" (J.R. Firth, 1957). Words appearing in similar contexts (king/queen near "throne", "royal") end up near each other in vector space.
From sparse one-hots to a meaning map
The animation contrasts sparse one-hot vectors with a dense embedding space — where related words cluster and even analogies become vector arithmetic.
What you gain
"happy" and "joyful" land near each other; cosine similarity measures it.
A dense 300-dim vector replaces a 50,000-long sparse one-hot.
king − man + woman ≈ queen. Relationships are directions in the space.
Reuse embeddings trained on billions of words to boost small datasets.
The methods
Several ways to learn these vectors, each in this track:
One vector per occurrence, so "bank" differs by context — the modern frontier.
Classic embeddings give a word one static vector regardless of context, so "river bank" and "money bank" collide. Contextual models (BERT, GPT) fix this by producing a different vector for the same word in each context.