Layer Normalization · Suman Bhadra Notes

Same recipe as BatchNorm — different axis

Batch normalization normalizes each feature across the batch — it slices the activation matrix column by column. Layer normalization flips the axis: it normalizes each sample — each token — across its own features, row by row. The recipe is identical (subtract the mean, divide by the std, then re-scale with learnable γ and shift with learnable β). The only thing that changes is which numbers you compute the mean and std over — and that one flip is why every transformer you've used runs on LayerNorm.

The formula (one token at a time)

For a token's features x₁ … x_d: compute μ = mean and σ² = variance of just those d numbers, then x̂ᵢ = (xᵢ − μ) / √(σ² + ε), and output γᵢ·x̂ᵢ + βᵢ. γ and β are learnable, one per feature, so the network can re-stretch or re-center if that helps. (Ba, Kiros & Hinton, 2016.)

Watch the axis flip

The same little activation matrix, sliced two ways: BatchNorm grabs a column (one feature, every sample), LayerNorm grabs a row (one token, every feature). Then see what happens when the batch shrinks to a single sequence.

Why transformers picked LayerNorm

A transformer processes sequences of wildly different lengths, often one at a time at inference. BatchNorm's "statistics across the batch" assumption falls apart there; LayerNorm never needed a batch in the first place.

Any batch size even 1

Each token computes its own mean and std from its own features — no other sample required.

No train/test mismatch no running averages

BatchNorm must save running statistics for inference. LayerNorm does the exact same computation in training and inference.

Variable-length sequences per-token

Every token normalizes itself independently, so padding and sequence length simply don't matter.

Where it sits: pre-LN vs post-LN

Inside a transformer block, the norm can go before the attention/FFN sub-layer (inside the residual branch) or after the residual add. The original 2017 Transformer used post-LN; almost every modern LLM uses pre-LN.

Pre-LN (modern default)

Norm before attention / FFN, inside the residual branch
The skip path stays untouched → clean gradient flow through deep stacks
Stable from step one, much less warm-up fuss

Post-LN (the original)

Norm after the residual add
Gradients must pass through every norm → fragile in deep stacks
Needs careful learning-rate warm-up to train at all

RMSNorm: the lightweight modern variant

RMSNorm (Zhang & Sennrich, 2019) noticed that most of LayerNorm's benefit comes from the re-scaling, not the re-centering. So it skips the mean entirely: just divide each token's features by their RMS — √(mean of xᵢ²) — and scale by a learnable γ. No mean, no β, fewer operations, essentially the same quality.

What the big open models use

Llama, Qwen, DeepSeek and most other open-source LLM architectures use pre-norm RMSNorm in every block. When you read a modern model card, "RMSNorm" is just LayerNorm with the mean-centering (and β) dropped.