Positional Encoding
Attention forgot about order
Self-attention treats its input as a set, not a sequence — shuffle the words and the math is identical. But "dog bites man" and "man bites dog" mean very different things. The transformer needs a way to know where each token sits.
The fix: positional encoding. Add a position-dependent vector to each token's embedding before attention. Now every token carries both what it is and where it is.
The sinusoidal signature
Watch the order problem, then the position vectors — a stack of sine/cosine waves at different frequencies — get added to the word embeddings.
How the classic version works
Each dimension of the encoding is a sine or cosine wave; low dimensions wiggle fast, high dimensions slowly.
The combination of wave values gives every position a distinct vector the model can read.
Nearby positions get similar encodings, so the model can reason about how far apart tokens are.
input = token_embedding + positional_encoding. No new parameters for the sinusoidal version — it's a fixed formula.
Slide a position along the sequence and watch its fingerprint form. The dots where the slider line crosses each wave are the entries of that position's vector (strip on the right). Below, the dot product PE(p)·PE(q) against every other position q shows why this works: it peaks at p and fades smoothly with distance.
Drag slowly: the fast wave (blue) changes every step — it distinguishes neighbors — while the slow wave (green) barely moves — it encodes the coarse region. Together they pin down the exact position, and the similarity bump below follows you everywhere.
Variants
- Sinusoidal (original) — fixed, extrapolates to longer sequences
- Learned — a trainable vector per position (BERT, GPT-2)
- Rotary (RoPE) — rotates Q/K by position; in many modern LLMs
- Without it, transformers are order-blind
- Choice affects how well models handle long contexts
- An active area of research