Batch Normalization
Keep activations well-behaved
As a network trains, the distribution of each layer's inputs keeps shifting — every weight update changes what the next layer sees. That moving target slows training. Batch normalization steadies it by normalizing each layer's activations on the fly.
1. Subtract the batch mean, divide by the batch std → mean 0, variance 1. 2. Scale and shift by learnable γ and β so the network can undo it if needed.
Watch the distribution snap into shape
A messy batch of activations gets centered and scaled to a clean distribution, then re-scaled by the learnable parameters.
Why it helps
Helps with vanishing/exploding gradients and eases init sensitivity.
The per-batch statistics add a little noise — a mild regularizing effect.
Practical notes
- Usually placed before the activation
- At test time, uses running mean/var, not the batch
- Workhorse of CNNs (ResNet, etc.)
- Weak with tiny batches (noisy stats)
- Awkward in RNNs → use LayerNorm instead
- Transformers use LayerNorm, not BatchNorm
LayerNorm (normalize across features per example), GroupNorm, InstanceNorm — same idea, different axis. LayerNorm dominates NLP and transformers.