Optimizers — Momentum, RMSprop, Adam

Why plain gradient descent isn't enough

Loss surfaces are full of long, narrow valleys. Plain gradient descent zig-zags across such a ravine — the gradient points mostly across the valley, not along it — so progress crawls.

Optimizers are smarter update rules that fix this. They remember past gradients and adapt the step, getting to the minimum far faster and more reliably.

Race them down a ravine

On an elongated valley, watch plain SGD bounce side to side while Momentum and Adam cut straight down the middle.

The big three

Momentum velocity

Accumulate a running average of gradients — like a heavy ball that builds speed downhill and damps the side-to-side wobble.

RMSprop adaptive step

Divide each parameter's step by a running average of its recent gradient size — big-gradient directions get smaller steps, and vice-versa.

Adam momentum + RMSprop

Combines both: a velocity term and per-parameter scaling. Robust and fast — the default for most deep learning.

Which to pick

Use Adam when

You want a strong default that "just works"
Transformers, most modern nets
You don't want to hand-tune schedules

Use SGD + Momentum when

Training CNNs for best final accuracy
You can afford to tune the schedule
You want better generalization in some vision tasks

Defaults

Adam with learning rate 0.001 and β₁=0.9, β₂=0.999 is the canonical starting point. AdamW (Adam + proper weight decay) is now standard for transformers.