Multi-Head Attention
Many attentions, in parallel
A single attention pattern can only focus one way at a time. But words relate in many ways at once — grammar, meaning, reference. Multi-head attention runs several attention mechanisms in parallel, each with its own learned Q/K/V projections, then combines them.
Each "head" is free to specialize: one might track subject-verb agreement, another might resolve pronouns, another might attend to nearby words. Together they capture a far richer picture than any single head.
See the heads specialize
Watch the same sentence attended by several heads, each with a different focus pattern — then concatenated into one rich representation.
How it's built
Project the input into h separate, smaller Q/K/V sets — one per head.
Run attention independently in each head. They run simultaneously — no extra time cost.
Stitch the heads' outputs back together and pass through a final linear layer.
The original transformer used 8 heads. Each head works in a smaller subspace (model dim ÷ heads), so total compute is similar to one big attention — but with far more expressive power.
This is the view researchers actually use: the attention matrix. Each row is a query, each column a key, and the cell shows how much weight flows between them. Switch heads to see three completely different patterns over the same sentence, and click any row to inspect that word's full distribution.
Head 1 hugs the diagonal (local context), Head 2 pours everything into "cat" (the subject), Head 3 lights up the verbs. The Average blurs them together — exactly why the model keeps the heads separate until the end.
Why heads help
Like an ensemble within a single layer, multiple heads let the model attend to different positions and representation subspaces at once. Probing studies show real heads do specialize — some track syntax, some track coreference. This is a core ingredient of the transformer behind every modern LLM.