The Attention Mechanism & Self-Attention

Deep Learning attention self-attention Q · K · V

Look at everything, weigh what matters

RNNs squeeze a whole sequence into one hidden state — a bottleneck that loses distant detail. Attention removes the bottleneck: when processing a word, the model looks directly at every other word and decides how much each one matters.

The "looking back" idea

To understand "it" in "The animal didn't cross the street because it was tired", attention lets "it" look back and put most of its weight on "animal" — resolving the reference directly.

Queries, keys, and values

Self-attention gives every word three roles. Watch one word's query compare against all keys, turn into weights, and pull a weighted blend of values.

The recipe

Query (Q) what I'm looking for

Each word emits a query — the question "which other words are relevant to me?"

Key (K) what I offer

Each word also emits a key — an advertisement of what it can answer.

Value (V) what I pass on

The actual content to be mixed in, weighted by how well query and key match.

The formula

Attention(Q,K,V) = softmax(QKᵀ / √d) · V — score every query-key pair, softmax into weights, take a weighted sum of values.

The animation above followed one fixed query — but every word runs this at once. Click any token below to make it the query and see where its attention goes. The scale slider plays the role of 1/√d: it controls how peaky the softmax gets.

Try each word: "it" and "tired" both look hard at "animal"; "because" links the two clauses. Then push scale to 3 — attention collapses onto one word (this is why big dot products get divided by √d); at 0.2 it can barely focus at all.

Why it changed everything

Strengths
  • Direct long-range connections — no fading memory
  • Parallel — all positions at once (no time loop)
  • Interpretable attention weights
Cost
  • O(n²) — every word attends to every word
  • Needs positional encoding (no built-in order)
  • Hungry for data and compute
Next

Run several attentions in parallel for multi-head attention, and you have the core of the transformer.