Weight Initialization — Xavier vs He

Deep Learning initialization Xavier He

Where you start matters

Before training, every weight needs a starting value. That choice sounds trivial — but get it wrong and the network never trains at all.

All zeros symmetry

Every neuron computes the same thing and gets the same gradient — they stay identical forever. The network can't learn.

Too large explode

Activations grow layer by layer until they saturate or blow up. Gradients explode.

Too small vanish

Activations shrink toward zero through the layers. Gradients vanish and learning stalls.

See variance across layers

Track the spread of activations through a deep stack: too-small fades out, too-large blows up, but a good init keeps it steady.

The principled fixes

Xavier / Glorot var = 1/n

Scale weights by the layer's fan-in (and fan-out). Designed for tanh/sigmoid — keeps variance ≈ constant across layers.

He / Kaiming var = 2/n

A bigger scale that accounts for ReLU zeroing out half the activations. The default for ReLU nets.

Small random bias break symmetry

Weights are random (not zero) to break symmetry; biases often start at 0.

The core idea

Choose the random scale so each layer's output variance ≈ its input variance. Then activations neither blow up nor fade as they pass through many layers — and gradients stay healthy.

In practice

Just use the defaults

Frameworks pick a sensible init automatically — He for ReLU layers, Xavier for tanh. Combined with BatchNorm and ReLU, init is far less fragile than it once was — but it still matters for very deep nets.