How Transformers Work

transformer attention self-attention multi-head positional encoding encoder decoder

What it is

A transformer is the neural-network architecture behind every modern large language model — GPT, Claude, Gemini, Llama, BERT, T5. It reads a whole sequence of tokens at once and lets every token look at every other token through a mechanism called attention.

Older sequence models (RNNs, LSTMs) processed text strictly left-to-right, one token at a time, carrying a single hidden state forward. Transformers throw that constraint out: the entire input is in memory simultaneously, and each token decides for itself which other tokens are worth listening to.

In one sentence

Every token asks every other token "how relevant are you to me?" — and rebuilds itself as a weighted blend of the answers.

Why it replaced RNNs

The 2017 paper that introduced transformers was titled "Attention Is All You Need." The point was that recurrence wasn't necessary — attention alone, applied at scale, was enough.

RNNs / LSTMs
  • Sequential — token n can only be processed after token n-1. No parallelism.
  • Long-range decay — information from token 1 has to survive 1000 hidden-state updates to reach token 1000. Most of it doesn't.
  • Slow to train — a fixed compute budget means smaller models and shorter sequences.
Transformers
  • Parallel — every token's attention is computed at the same time on the GPU.
  • Direct connections — token 1000 can read token 1 in a single attention step. No telephone game.
  • Scales — bigger models, longer contexts, more data. The thing that made GPT-scale possible.

Attention, in one paragraph

Every token is projected into three vectors: a query (what am I looking for?), a key (what do I represent?), and a value (what would I contribute if you picked me?). To compute a token's new representation, take its query, dot-product it with every other token's key to get raw attention scores, divide by √d to keep gradients stable, run a softmax to turn scores into a probability distribution, then take a weighted sum of all the value vectors using those probabilities. The whole thing is just softmax(QKᵀ / √d) · V.

Query (Q) "what I'm looking for"

The current token's question — projected from its embedding by a learned weight matrix W_Q.

Key (K) "what I'm about"

Every token's advertisement of itself. Compatibility with a query is measured as Q · K.

Value (V) "what I'd pass on"

The actual content a token contributes if attended to. Outputs are weighted sums of value vectors.

Self-attention

In self-attention, the queries, keys, and values all come from the same sequence — every token is simultaneously asking and answering. The output for each position is a context-aware version of that token, blended from the rest of the sentence by similarity weight.

Take the four-token sentence "the cat sat down". When the model computes the new representation for "sat", its query gets dot-producted against the keys of "the", "cat", "sat", and "down". The softmax might land on something like [0.05, 0.55, 0.10, 0.30] — most of the weight on "cat" (the subject), some on "down" (the modifier), almost none on "the". The new "sat" vector is that weighted average of the four value vectors. After one self-attention layer, "sat" no longer just means sat in the abstract — it carries a trace of cat sat down.

Multi-head attention

One attention computation can only learn one kind of relationship at a time. Real language has many — subject-verb agreement, coreference, syntactic dependency, semantic similarity, position. The fix is to run several attention computations in parallel, each on its own learned projection of Q, K, V, and then combine the results.

Heads 8, 16, 32, …

Independent attention computations running side-by-side. Each head can specialise — one tracks syntax, another long-range references, another local punctuation.

Per-head subspace d / h dimensions

Each head gets its own W_Q^h, W_K^h, W_V^h matrices that project the embedding into a smaller subspace, so total compute stays comparable to single-head attention.

Concatenate stitch heads together

The h per-head outputs are concatenated back into one vector of the original width.

Output project W_O

A final linear layer mixes information across heads so the next layer can use it as a unified representation.

Positional encoding

Attention has a quirk: it's permutation-invariant. If you shuffle the input tokens, the attention computation produces exactly the same set of outputs (just in a different order). That's a problem — word order matters.

The core insight

Without positional encoding, "dog bites man" looks identical to "man bites dog" to a transformer.

The fix is to add a position-dependent vector to each token's embedding before it enters the first attention layer, so identical tokens at different positions arrive with different vectors.

Sinusoidal original (2017)

A fixed pattern of sines and cosines at geometrically-spaced frequencies. Not learned — chosen so relative positions correspond to predictable rotations in the encoding space.

Learned BERT, GPT-2

A trainable lookup table: one vector per absolute position. Simple, but limited to whatever max length was seen during training.

RoPE Llama, modern LLMs

Rotary Position Embedding: rotates Q and K vectors by an angle that depends on position. Encodes relative positions and extrapolates better to long contexts.

The encoder block

The encoder is a stack of identical blocks (the original paper used 6, modern models use anywhere from 12 to 96). Each block has the same four-step recipe, with a residual connection wrapping each sub-layer so gradients can flow straight through.

1 — Multi-head self-attention mix tokens

Every token gathers information from every other token in the sequence.

2 — Add & Norm residual + layer norm

Add the input back to the attention output (skip connection), then normalise. Keeps deep stacks trainable.

3 — Feed-forward per-token MLP

A small two-layer MLP applied independently to each position. This is where most of the model's parameters live and where "thinking" per token happens.

4 — Add & Norm residual + layer norm

Same trick again. The block's output has the same shape as its input, so blocks can be stacked indefinitely.

The decoder block

A decoder block looks like an encoder block plus one extra sub-layer — and with one critical change to the self-attention.

1 — Masked self-attention causal mask

Each token can only attend to itself and earlier positions. Future positions are zeroed out before the softmax. This is what lets the decoder predict the next token without cheating during training.

2 — Cross-attention Q from decoder, K/V from encoder

The bridge between the two halves. The decoder asks questions; the encoder's final output supplies the keys and values it can pull from.

3 — Feed-forward per-token MLP

Same per-position MLP as in the encoder.

+ Add & Norm × 3 after each sub-layer

Residual connection and layer norm wrap every sub-layer above, just like the encoder.

Why mask?

During training the model sees the full target sentence, but at inference time it has to generate one token at a time. The causal mask forces training to behave like inference — token t never gets to peek at tokens t+1, t+2, …

Watch the flow

The animation walks the full architecture end-to-end: tokens get embedded, positional encodings are added, self-attention scores light up between every pair of tokens, multi-head attention splits and merges, the encoder stack produces a contextualised representation, and the decoder generates an output token using masked self-attention plus cross-attention into the encoder.

Three flavours of transformer

The original 2017 transformer had both an encoder and a decoder, designed for translation. Most modern models keep only one half — and the choice changes what the model is good at.

Encoder-only BERT, RoBERTa

Bidirectional self-attention — every token sees every other. Great for classification, embeddings, sentence understanding. Cannot generate fluently because it has no causal mask.

Decoder-only GPT, Claude, Llama

Masked self-attention only — every token sees the past, never the future. The dominant design for chatbots and code generation. Trained to predict the next token, used to write everything.

Encoder–decoder T5, original transformer

Both halves connected by cross-attention. Used when input and output are clearly different sequences — translation, summarisation, structured-to-text.

Gotchas worth knowing

Common pitfalls
  • Quadratic cost — attention compares every token to every other, so compute and memory both grow as O(n²) in sequence length. This is why context windows are bounded and why sparse / linear / sliding-window attention variants exist.
  • Mask leakage — a single off-by-one in the causal mask lets future tokens leak into past predictions during training. The model looks great in training and falls apart at inference. Always verify the mask shape.
  • Training vs. inference asymmetry — training is fully parallel: all positions computed at once. Inference is autoregressive: one token, then re-run, then one more. KV caching is what stops inference from quadratically re-doing the same work.
  • Position extrapolation — a model trained on 4k-token contexts may answer fine on 4k inputs but degrade sharply on 8k. The position scheme determines how gracefully you can extend.
  • Numerical stability — without the √d divisor before softmax, attention scores blow up at high dimensions and gradients vanish. Skipping the scaling factor is a classic "why is my model not learning" bug.