Vanishing & Exploding Gradients · Suman Bhadra Notes

Why depth is hard

Backpropagation computes a layer's gradient by multiplying the gradients of all the layers after it. Multiply many numbers together and trouble follows.

Vanishing factors < 1

Multiply many numbers below 1 and the product shrinks toward zero. Early layers get almost no gradient → they barely learn.

Exploding factors > 1

Multiply many numbers above 1 and the product blows up. Weights swing wildly → loss becomes NaN.

This is exactly why early deep and recurrent networks were so hard to train — and why sigmoid/tanh, whose derivatives are at most 1 (sigmoid peaks at just 0.25) and shrink toward 0 as the unit saturates, make vanishing worse.

Watch the gradient shrink or blow up

The same error enters at the output and propagates back through 8 layers, multiplied by a per-layer factor — see it fade to nothing, or explode.

The fixes that made deep learning work

ReLU activations gradient = 1

ReLU's derivative is 1 for positive inputs — no shrinking. The first big fix.

Residual connections skip paths

ResNet's shortcuts give gradients a direct highway back, bypassing the multiplications.

Batch normalization rescale activations

BatchNorm keeps activations well-scaled, stabilizing gradients.

Careful init Xavier / He

Good weight initialization keeps the per-layer factor near 1 from the start.

Gradient clipping cap the norm

For exploding gradients (common in RNNs), clip the gradient to a maximum size.

Gated units LSTM / GRU

LSTMs use a cell state that lets gradients flow across many time steps.