Weight Initialization — Xavier vs He
Where you start matters
Before training, every weight needs a starting value. That choice sounds trivial — but get it wrong and the network never trains at all.
Every neuron computes the same thing and gets the same gradient — they stay identical forever. The network can't learn.
Activations grow layer by layer until they saturate or blow up. Gradients explode.
Activations shrink toward zero through the layers. Gradients vanish and learning stalls.
See variance across layers
Track the spread of activations through a deep stack: too-small fades out, too-large blows up, but a good init keeps it steady.
The principled fixes
Scale weights by the layer's fan-in (and fan-out). Designed for tanh/sigmoid — keeps variance ≈ constant across layers.
A bigger scale that accounts for ReLU zeroing out half the activations. The default for ReLU nets.
Weights are random (not zero) to break symmetry; biases often start at 0.
Choose the random scale so each layer's output variance ≈ its input variance. Then activations neither blow up nor fade as they pass through many layers — and gradients stay healthy.