Positional Encoding · Suman Bhadra Notes

Attention forgot about order

Self-attention treats its input as a set, not a sequence — shuffle the words and the math is identical. But "dog bites man" and "man bites dog" mean very different things. The transformer needs a way to know where each token sits.

The fix: positional encoding. Add a position-dependent vector to each token's embedding before attention. Now every token carries both what it is and where it is.

The sinusoidal signature

Watch the order problem, then the position vectors — a stack of sine/cosine waves at different frequencies — get added to the word embeddings.

How the classic version works

Sines & cosines many frequencies

Each dimension of the encoding is a sine or cosine wave; low dimensions wiggle fast, high dimensions slowly.

Unique per position a fingerprint

The combination of wave values gives every position a distinct vector the model can read.

Relative distances smooth

Nearby positions get similar encodings, so the model can reason about how far apart tokens are.

Just add it

input = token_embedding + positional_encoding. No new parameters for the sinusoidal version — it's a fixed formula.

Slide a position along the sequence and watch its fingerprint form. The dots where the slider line crosses each wave are the entries of that position's vector (strip on the right). Below, the dot product PE(p)·PE(q) against every other position q shows why this works: it peaks at p and falls off with distance near the peak — the ripples further out come from the individual frequencies, but no other position ever matches the peak.

position p

Drag slowly: the fast wave (blue) changes every step — it distinguishes neighbors — while the slow wave (green) barely moves — it encodes the coarse region. Together they pin down the exact position, and the similarity bump below follows you everywhere.

Variants

Common approaches

Sinusoidal (original) — fixed, extrapolates to longer sequences
Learned — a trainable vector per position (BERT, GPT-2)
Rotary (RoPE) — rotates Q/K by position; in many modern LLMs

Why it matters

Without it, transformers are order-blind
Choice affects how well models handle long contexts
An active area of research