Layer Normalization
Same recipe as BatchNorm — different axis
Batch normalization normalizes each feature across the batch — it slices the activation matrix column by column. Layer normalization flips the axis: it normalizes each sample — each token — across its own features, row by row. The recipe is identical (subtract the mean, divide by the std, then re-scale with learnable γ and shift with learnable β). The only thing that changes is which numbers you compute the mean and std over — and that one flip is why every transformer you've used runs on LayerNorm.
For a token's features x₁ … x_d: compute μ = mean and σ² = variance of just those d numbers, then x̂ᵢ = (xᵢ − μ) / √(σ² + ε), and output γᵢ·x̂ᵢ + βᵢ. γ and β are learnable, one per feature, so the network can re-stretch or re-center if that helps. (Ba, Kiros & Hinton, 2016.)
Watch the axis flip
The same little activation matrix, sliced two ways: BatchNorm grabs a column (one feature, every sample), LayerNorm grabs a row (one token, every feature). Then see what happens when the batch shrinks to a single sequence.
Why transformers picked LayerNorm
A transformer processes sequences of wildly different lengths, often one at a time at inference. BatchNorm's "statistics across the batch" assumption falls apart there; LayerNorm never needed a batch in the first place.
Each token computes its own mean and std from its own features — no other sample required.
BatchNorm must save running statistics for inference. LayerNorm does the exact same computation in training and inference.
Every token normalizes itself independently, so padding and sequence length simply don't matter.
Where it sits: pre-LN vs post-LN
Inside a transformer block, the norm can go before the attention/FFN sub-layer (inside the residual branch) or after the residual add. The original 2017 Transformer used post-LN; almost every modern LLM uses pre-LN.
- Norm before attention / FFN, inside the residual branch
- The skip path stays untouched → clean gradient flow through deep stacks
- Stable from step one, much less warm-up fuss
- Norm after the residual add
- Gradients must pass through every norm → fragile in deep stacks
- Needs careful learning-rate warm-up to train at all
RMSNorm: the lightweight modern variant
RMSNorm (Zhang & Sennrich, 2019) noticed that most of LayerNorm's benefit comes from the re-scaling, not the re-centering. So it skips the mean entirely: just divide each token's features by their RMS — √(mean of xᵢ²) — and scale by a learnable γ. No mean, no β, fewer operations, essentially the same quality.
Llama, Qwen, DeepSeek and most other open-source LLM architectures use pre-norm RMSNorm in every block. When you read a modern model card, "RMSNorm" is just LayerNorm with the mean-centering (and β) dropped.