Backpropagation
Assigning blame for the error
Gradient descent needs the gradient of the loss with respect to every weight. In a deep network that's millions of derivatives. Backpropagation computes them all efficiently — in a single backward pass.
The idea: after the forward pass produces a prediction and a loss, push that error backward through the network. Each layer asks "how much did I contribute to the error?" and passes the answer to the layer before it.
The loss depends on a weight through a chain of operations. The chain rule multiplies the local derivatives along that chain: ∂L/∂w = ∂L/∂a · ∂a/∂z · ∂z/∂w.
Forward, then backward
Watch a value flow forward to a prediction and loss, then the error flow backward, depositing a gradient on every weight along the way.
The chain rule, intuitively
If wiggling a weight changes z a little, and changing z changes the activation a a little, and changing a changes the loss a little — then the total effect is those small effects multiplied together. That's all the chain rule is.
The gradient at a layer is built from the gradient of the layer after it — computed once, reused. No redundant recomputation.
A single sweep from output to input yields every weight's gradient.
Here is the chain rule with real numbers: one neuron, x = 2, target y = 1. Slide w and watch every local derivative update; their product is the gradient. Then press descent step to apply w ← w − η·∂L/∂w and watch the loss fall.
Notice the middle factor a(1−a): it can never exceed 0.25, and near w = ±2 it almost vanishes — that single factor is the seed of the vanishing-gradient problem. Click descent step a few times: the gradient drives w uphill on a, downhill on L.
The training loop
1. Forward pass → prediction. 2. Compute loss. 3. Backward pass → gradients. 4. Update weights. Loop over batches and epochs until the loss is low.
Modern frameworks (PyTorch, TensorFlow) do step 3 automatically via autodiff — you write only the forward pass, and they backprop for you. But watch out for vanishing & exploding gradients in very deep nets.