Word2Vec — CBOW & Skip-gram
Learn meaning by predicting neighbours
Word2Vec (Google, 2013) learns word embeddings with a beautifully simple trick: train a shallow network to predict words from their neighbours. The hidden weights it learns are the word vectors.
Because words that share contexts get pushed to similar vectors, the side-effect of this prediction game is a space where meaning lives in geometry — no labels required, just raw text.
Two architectures
Word2Vec comes in two flavours that swap input and output. Watch a sliding window (shown with one word each side for clarity; real models use 5–10), then each variant's prediction direction.
CBOW vs Skip-gram
- Input: context words → predict the centre word
- Faster to train
- Better for frequent words
- Input: centre word → predict the context words
- Slower, but stronger on small data
- Better for rare words
Predicting over a 50,000-word vocabulary every step is too slow, so Word2Vec uses negative sampling: nudge the true context word up and a handful of random words down. Cheap and effective.
What you get
"Paris" sits near "London"; "happy" near "joyful". Measure it with cosine similarity.
king − man + woman ≈ queen; Paris − France + Italy ≈ Rome.
Download vectors trained on billions of words and plug them in.
Do the famous vector math yourself on a toy 2-D embedding. Pick an analogy: the grey arrow is the learned relationship (king→man's offset, reversed), the orange dashed arrow re-applies it from the third word, and the star is where the arithmetic lands — ringed is the nearest real word.
Notice the three grey arrows are (almost) parallel within each family: gender, capital-of, and -ing all became consistent directions in the space. Nobody programmed that — it fell out of predicting neighbouring words.
One static vector per word — "bank" gets a single vector for both river and money senses. Contextual models (BERT/GPT) fixed this later. Next: GloVe, a co-occurrence-based alternative.