Backpropagation

Deep Learning chain rule gradients training

Assigning blame for the error

Gradient descent needs the gradient of the loss with respect to every weight. In a deep network that's millions of derivatives. Backpropagation computes them all efficiently — in a single backward pass.

The idea: after the forward pass produces a prediction and a loss, push that error backward through the network. Each layer asks "how much did I contribute to the error?" and passes the answer to the layer before it.

The engine: the chain rule

The loss depends on a weight through a chain of operations. The chain rule multiplies the local derivatives along that chain: ∂L/∂w = ∂L/∂a · ∂a/∂z · ∂z/∂w.

Forward, then backward

Watch a value flow forward to a prediction and loss, then the error flow backward, depositing a gradient on every weight along the way.

The chain rule, intuitively

Why it's a product

If wiggling a weight changes z a little, and changing z changes the activation a a little, and changing a changes the loss a little — then the total effect is those small effects multiplied together. That's all the chain rule is.

Reuse work efficient

The gradient at a layer is built from the gradient of the layer after it — computed once, reused. No redundant recomputation.

One backward pass all gradients

A single sweep from output to input yields every weight's gradient.

Then update descent step

Feed the gradients to gradient descent: w ← w − η·∂L/∂w.

Here is the chain rule with real numbers: one neuron, x = 2, target y = 1. Slide w and watch every local derivative update; their product is the gradient. Then press descent step to apply w ← w − η·∂L/∂w and watch the loss fall.

Notice the middle factor a(1−a): it can never exceed 0.25, and near w = ±2 it almost vanishes — that single factor is the seed of the vanishing-gradient problem. Click descent step a few times: the gradient drives w uphill on a, downhill on L.

The training loop

Four steps, repeated

1. Forward pass → prediction. 2. Compute loss. 3. Backward pass → gradients. 4. Update weights. Loop over batches and epochs until the loss is low.

Modern frameworks (PyTorch, TensorFlow) do step 3 automatically via autodiff — you write only the forward pass, and they backprop for you. But watch out for vanishing & exploding gradients in very deep nets.