Positional Encoding

Deep Learning transformer order sinusoidal

Attention forgot about order

Self-attention treats its input as a set, not a sequence — shuffle the words and the math is identical. But "dog bites man" and "man bites dog" mean very different things. The transformer needs a way to know where each token sits.

The fix: positional encoding. Add a position-dependent vector to each token's embedding before attention. Now every token carries both what it is and where it is.

The sinusoidal signature

Watch the order problem, then the position vectors — a stack of sine/cosine waves at different frequencies — get added to the word embeddings.

How the classic version works

Sines & cosines many frequencies

Each dimension of the encoding is a sine or cosine wave; low dimensions wiggle fast, high dimensions slowly.

Unique per position a fingerprint

The combination of wave values gives every position a distinct vector the model can read.

Relative distances smooth

Nearby positions get similar encodings, so the model can reason about how far apart tokens are.

Just add it

input = token_embedding + positional_encoding. No new parameters for the sinusoidal version — it's a fixed formula.

Slide a position along the sequence and watch its fingerprint form. The dots where the slider line crosses each wave are the entries of that position's vector (strip on the right). Below, the dot product PE(p)·PE(q) against every other position q shows why this works: it peaks at p and fades smoothly with distance.

Drag slowly: the fast wave (blue) changes every step — it distinguishes neighbors — while the slow wave (green) barely moves — it encodes the coarse region. Together they pin down the exact position, and the similarity bump below follows you everywhere.

Variants

Common approaches
  • Sinusoidal (original) — fixed, extrapolates to longer sequences
  • Learned — a trainable vector per position (BERT, GPT-2)
  • Rotary (RoPE) — rotates Q/K by position; in many modern LLMs
Why it matters
  • Without it, transformers are order-blind
  • Choice affects how well models handle long contexts
  • An active area of research