Word2Vec — CBOW & Skip-gram · Suman Bhadra Notes

Learn meaning by predicting neighbours

Word2Vec (Google, 2013) learns word embeddings with a beautifully simple trick: train a shallow network to predict words from their neighbours. The hidden weights it learns are the word vectors.

Because words that share contexts get pushed to similar vectors, the side-effect of this prediction game is a space where meaning lives in geometry — no labels required, just raw text.

Two architectures

Word2Vec comes in two flavours that swap input and output. Watch a sliding window (shown with one word each side for clarity; real models use 5–10), then each variant's prediction direction.

CBOW vs Skip-gram

CBOW (Continuous Bag of Words)

Input: context words → predict the centre word
Faster to train
Better for frequent words

Skip-gram

Input: centre word → predict the context words
Slower, but stronger on small data
Better for rare words

The training trick

Predicting over a 50,000-word vocabulary every step is too slow, so Word2Vec uses negative sampling: nudge the true context word up and a handful of random words down. Cheap and effective.

What you get

Similar words cluster geometry = meaning

"Paris" sits near "London"; "happy" near "joyful". Measure it with cosine similarity.

Analogies vector math

king − man + woman ≈ queen; Paris − France + Italy ≈ Rome.

Reusable pretrained

Download vectors trained on billions of words and plug them in.

Do the famous vector math yourself on a toy 2-D embedding. Pick an analogy: the grey arrow is the learned relationship (king→man's offset, reversed), the orange dashed arrow re-applies it from the third word, and the star is where the arithmetic lands — ringed is the nearest real word.

Notice the star always lands almost exactly on the answer: within each family the pair offsets (man→king vs woman→queen, france→paris vs italy→rome) are nearly parallel — gender, capital-of, and -ing all became consistent directions in the space. Nobody programmed that — it fell out of predicting neighbouring words.

Limitation

One static vector per word — "bank" gets a single vector for both river and money senses. Contextual models (BERT/GPT) fixed this later. Next: GloVe, a co-occurrence-based alternative.