Gradient Descent

Deep Learning optimization training gradient

Rolling downhill, blindfolded

Training means finding the weights that make the loss as small as possible. Gradient descent does it with a simple, repeatable move: figure out which way is downhill, and step that way.

Picture the loss as a hilly landscape and the model as a ball on it. You can't see the whole valley, but you can feel the slope under your feet — the gradient. Step downhill, feel again, step again. Eventually you reach the bottom.

The update rule

w ← w − η · ∂Loss/∂w

Move each weight opposite its gradient, scaled by the learning rate η. The minus sign is what makes it go down.

Watch the ball roll

Starting from a bad weight, the animation reads the slope and steps downhill again and again until it settles at the minimum loss.

The pieces of the rule

Gradient ∂Loss/∂w the slope

Which direction increases the loss, and how steeply. Points uphill — so we go the opposite way.

Learning rate η step size

How far to step. Too small = crawls; too big = overshoots. See Learning Rate.

Iteration repeat

One step barely moves; thousands of steps reach the valley floor.

Where the gradient comes from

In a network with millions of weights, backpropagation computes all the gradients efficiently in one backward pass.

Feel the learning rate yourself. Drag the ball anywhere on the curve, pick an η, and press Step repeatedly. Small η crawls, moderate η converges, η past ~6.7 makes each step land further from the minimum than it started — divergence.

Three experiments: η = 0.3 (watch it crawl — count the steps), η = 6.0 (it leaps across the valley and oscillates in), η = 7.0 (each bounce is bigger; the loss explodes). This is the single most common reason training produces NaN.

The bumps in the road

Works because
  • The gradient gives a guaranteed downhill direction (locally)
  • It scales to billions of parameters
  • Backprop makes it cheap per step
Watch out for
  • Local minima & saddle points in complex surfaces
  • A bad learning rate that diverges or stalls
  • Computing the gradient on all data is slow → use mini-batches