Batch Normalization · Suman Bhadra Notes

Keep activations well-behaved

As a network trains, the distribution of each layer's inputs keeps shifting — every weight update changes what the next layer sees. That moving target slows training. Batch normalization steadies it by normalizing each layer's activations on the fly.

The recipe (per feature, per mini-batch)

1. Subtract the batch mean, divide by the batch std → mean 0, variance 1. 2. Scale and shift by learnable γ and β so the network can undo it if needed.

Watch the distribution snap into shape

A messy batch of activations gets centered and scaled to a clean distribution, then re-scaled by the learnable parameters.

Why it helps

Faster training higher LR

Stable activations let you use a bigger learning rate without diverging.

Smoother gradients less sensitivity

Helps with vanishing/exploding gradients and eases init sensitivity.

Slight regularization batch noise

The per-batch statistics add a little noise — a mild regularizing effect.

Practical notes

Good to know

Usually placed before the activation
At test time, uses running mean/var, not the batch
Workhorse of CNNs (ResNet, etc.)

Watch out

Weak with tiny batches (noisy stats)
Awkward in RNNs → use LayerNorm instead
Transformers use LayerNorm, not BatchNorm

Cousins

LayerNorm (normalize across features per example), GroupNorm, InstanceNorm — same idea, different axis. LayerNorm dominates NLP and transformers.