Vanishing & Exploding Gradients
Why depth is hard
Backpropagation computes a layer's gradient by multiplying the gradients of all the layers after it. Multiply many numbers together and trouble follows.
Multiply many numbers below 1 and the product shrinks toward zero. Early layers get almost no gradient → they barely learn.
Multiply many numbers above 1 and the product blows up. Weights swing wildly → loss becomes NaN.
This is exactly why early deep and recurrent networks were so hard to train — and why sigmoid/tanh, whose derivatives are always < 1, make vanishing worse.
Watch the gradient shrink or blow up
The same error enters at the output and propagates back through 8 layers, multiplied by a per-layer factor — see it fade to nothing, or explode.
The fixes that made deep learning work
ReLU's derivative is 1 for positive inputs — no shrinking. The first big fix.
ResNet's shortcuts give gradients a direct highway back, bypassing the multiplications.
BatchNorm keeps activations well-scaled, stabilizing gradients.
Good weight initialization keeps the per-layer factor near 1 from the start.
For exploding gradients (common in RNNs), clip the gradient to a maximum size.