Word2Vec (2013) — Words Become Vectors
The world before this paper
In 2012, computers could count words but not understand them. A word was an ID in a table — a single 1 in a vector of zeros. There was no notion of closeness: no way to say that "cat" sits near "dog" and far from "carburetor". Every approach to fixing this hit the same wall. Either it stayed dumb, or it was too slow to feed real amounts of text.
Words were atomic IDs or one-hot vectors. "Cat" and "dog" were exactly as unrelated as "cat" and "carburetor" — the representation simply had no room for meaning.
Language models counted word sequences. Seeing "strong tea" a million times taught them nothing about "powerful tea" — a phrase they had never counted.
Earlier neural language models (Bengio, 2003) did learn embeddings — but the full network was so expensive that web-scale training was out of reach.
The key idea
At Google in 2013, Tomas Mikolov and his colleagues made an unfashionable bet: the deep part of the neural language model was the problem, not the solution. Their paper — Mikolov, Chen, Corrado & Dean, "Efficient Estimation of Word Representations in Vector Space" (2013) — threw away the hidden layers entirely. What survived was almost embarrassingly simple: a word looks at its neighbors, and a single projection layer learns to predict one from the other. Two flavors: skip-gram predicts the surrounding words from the center word; CBOW predicts the center word from its surroundings.
Simplicity was the superpower. With tricks like hierarchical softmax and negative sampling to dodge the expensive output layer, the model could chew through billions of words on ordinary CPUs. And here is the magic: to get good at this trivial fill-in-the-blank game, the model is forced to place words used in similar contexts at nearby points. Nobody told it what a synonym was. The structure just appeared.
Strip the neural language model down to a single projection layer so it can train on billions of words — predicting context from word (skip-gram) or word from context (CBOW) — and meaning falls out as geometry.
Want the full mechanics? See Word2Vec mechanics.
Watch meaning become geometry
Step through the paper's whole arc: words start as a meaningless scatter, a sliding context window provides the only training signal, clusters form on their own — and then the famous parallelogram appears.
The results that mattered
The paper's evidence came in two forms: raw speed, and a party trick nobody could stop talking about. On new analogy test sets — semantic ("Athens is to Greece as Oslo is to ?") and syntactic ("quick is to quickly …") — the vectors answered by pure arithmetic, because relations were encoded as consistent offsets.
Trained in under a day on CPUs — earlier neural language models needed weeks for far less text. Web-scale was suddenly affordable.
Relations become vector offsets: the same direction that turns king into queen turns man into woman. Add and subtract meanings like numbers.
A word's entire meaning compressed into one dense vector. Its nearest neighbors by cosine similarity read like a synonym list.
Legacy — and the catch
- Made dense embeddings the universal first layer of NLP
- Pretraining on unlabeled text became standard practice
- Inspired GloVe, fastText, and eventually contextual embeddings
- One vector per word — "bank" (river) and "bank" (money) share it
- Context-free: meaning doesn't change with the sentence
- Superseded by contextual embeddings from ELMo, BERT and friends
Read the original: arXiv:1301.3781. For the concepts behind it, see Word2Vec mechanics, Why word embeddings, Cosine similarity, and GloVe. Next paper: Seq2Seq + Attention (2014).