The Attention Mechanism & Self-Attention
Look at everything, weigh what matters
RNNs squeeze a whole sequence into one hidden state — a bottleneck that loses distant detail. Attention removes the bottleneck: when processing a word, the model looks directly at every other word and decides how much each one matters.
To understand "it" in "The animal didn't cross the street because it was tired", attention lets "it" look back and put most of its weight on "animal" — resolving the reference directly.
Queries, keys, and values
Self-attention gives every word three roles. Watch one word's query compare against all keys, turn into weights, and pull a weighted blend of values.
The recipe
Each word emits a query — the question "which other words are relevant to me?"
Each word also emits a key — an advertisement of what it can answer.
The actual content to be mixed in, weighted by how well query and key match.
Attention(Q,K,V) = softmax(QKᵀ / √d) · V — score every query-key pair, softmax into weights, take a weighted sum of values.
The animation above followed one fixed query — but every word runs this at once. Click any token below to make it the query and see where its attention goes. The scale slider plays the role of 1/√d: it controls how peaky the softmax gets.
Try each word: "it" and "tired" both look hard at "animal"; "because" links the two clauses. Then push scale to 3 — attention collapses onto one word (this is why big dot products get divided by √d); at 0.2 it can barely focus at all.
Why it changed everything
- Direct long-range connections — no fading memory
- Parallel — all positions at once (no time loop)
- Interpretable attention weights
- O(n²) — every word attends to every word
- Needs positional encoding (no built-in order)
- Hungry for data and compute
Run several attentions in parallel for multi-head attention, and you have the core of the transformer.