Word2Vec (2013) — Words Become Vectors

The world before this paper

In 2012, computers could count words but not understand them. A word was an ID in a table — a single 1 in a vector of zeros. There was no notion of closeness: no way to say that "cat" sits near "dog" and far from "carburetor". Every approach to fixing this hit the same wall. Either it stayed dumb, or it was too slow to feed real amounts of text.

One-hot words no similarity

Words were atomic IDs or one-hot vectors. "Cat" and "dog" were exactly as unrelated as "cat" and "carburetor" — the representation simply had no room for meaning.

N-gram models no generalizing

Language models counted word sequences. Seeing "strong tea" a million times taught them nothing about "powerful tea" — a phrase they had never counted.

Neural LMs too slow

Earlier neural language models (Bengio, 2003) did learn embeddings — but the full network was so expensive that web-scale training was out of reach.

The key idea

At Google in 2013, Tomas Mikolov and his colleagues made an unfashionable bet: the deep part of the neural language model was the problem, not the solution. Their paper — Mikolov, Chen, Corrado & Dean, "Efficient Estimation of Word Representations in Vector Space" (2013) — threw away the hidden layers entirely. What survived was almost embarrassingly simple: a word looks at its neighbors, and a single projection layer learns to predict one from the other. Two flavors: skip-gram predicts the surrounding words from the center word; CBOW predicts the center word from its surroundings.

Simplicity was the superpower. With one clever trick — hierarchical softmax to dodge the expensive output layer — the model could chew through billions of words on ordinary CPUs. (Negative sampling, the speed-up most people now associate with word2vec, arrived months later in the team's follow-up paper, Distributed Representations of Words and Phrases and their Compositionality.) And here is the magic: to get good at this trivial fill-in-the-blank game, the model is forced to place words used in similar contexts at nearby points. Nobody told it what a synonym was. The structure just appeared.

The paper in one sentence

Strip the neural language model down to a single projection layer so it can train on billions of words — predicting context from word (skip-gram) or word from context (CBOW) — and meaning falls out as geometry.

Want the full mechanics? See Word2Vec mechanics.

Watch meaning become geometry

Step through the paper's whole arc: words start as a meaningless scatter, a sliding context window provides the only training signal, clusters form on their own — and then the famous parallelogram appears.

The results that mattered

The paper's evidence came in two forms: raw speed, and a party trick nobody could stop talking about. On new analogy test sets — semantic ("Athens is to Greece as Oslo is to ?") and syntactic ("quick is to quickly …") — the vectors answered by pure arithmetic, because relations were encoded as consistent offsets.

Scale 1.6B words

Trained in under a day on CPUs — earlier neural language models needed weeks for far less text. Web-scale was suddenly affordable.

The famous analogy king − man + woman ≈ queen

Relations become vector offsets: the same direction that turns king into queen turns man into woman. Add and subtract meanings like numbers.

Density 300 dims

A word's entire meaning compressed into one dense vector. Its nearest neighbors by cosine similarity read like a synonym list.

Legacy — and the catch

What it unlocked

Made dense embeddings the universal first layer of NLP
Pretraining on unlabeled text became standard practice
Inspired GloVe, fastText, and eventually contextual embeddings

The limits

One vector per word — "bank" (river) and "bank" (money) share it
Context-free: meaning doesn't change with the sentence
Superseded by contextual embeddings from ELMo, BERT and friends

Go deeper

Read the original: arXiv:1301.3781. For the concepts behind it, see Word2Vec mechanics, Why word embeddings, Cosine similarity, and GloVe. Next paper: Seq2Seq + Attention (2014).