Weight Initialization — Xavier vs He

Where you start matters

Before training, every weight needs a starting value. That choice sounds trivial — but get it wrong and the network never trains at all.

All zeros symmetry

Every neuron computes the same thing and gets the same gradient — they stay identical forever. The network can't learn.

Too large explode

Activations grow layer by layer until they saturate or blow up. Gradients explode.

Too small vanish

Activations shrink toward zero through the layers. Gradients vanish and learning stalls.

See variance across layers

Track the spread of activations through a deep stack: too-small fades out, too-large blows up, but a good init keeps it steady.

The principled fixes

Xavier / Glorot var = 1/n

Scale weights by the layer's fan-in (and fan-out). Designed for tanh/sigmoid — keeps variance ≈ constant across layers.

He / Kaiming var = 2/n

A bigger scale that accounts for ReLU zeroing out half the activations. The default for ReLU nets.

Small random weights break symmetry

Weights are random (not zero) to break symmetry; biases often start at 0.

The core idea

Choose the random scale so each layer's output variance ≈ its input variance. Then activations neither blow up nor fade as they pass through many layers — and gradients stay healthy.

In practice

Just use the defaults

Framework defaults are already sensible (Keras uses Xavier/Glorot, PyTorch a He/Kaiming variant) — the usual rule of thumb is He for ReLU layers, Xavier for tanh. Combined with BatchNorm and ReLU, init is far less fragile than it once was — but it still matters for very deep nets.