Optimizers — Momentum, RMSprop, Adam
Why plain gradient descent isn't enough
Loss surfaces are full of long, narrow valleys. Plain gradient descent zig-zags across such a ravine — the gradient points mostly across the valley, not along it — so progress crawls.
Optimizers are smarter update rules that fix this. They remember past gradients and adapt the step, getting to the minimum far faster and more reliably.
Race them down a ravine
On an elongated valley, watch plain SGD bounce side to side while Momentum and Adam cut straight down the middle.
The big three
Accumulate a running average of gradients — like a heavy ball that builds speed downhill and damps the side-to-side wobble.
Divide each parameter's step by a running average of its recent gradient size — big-gradient directions get smaller steps, and vice-versa.
Combines both: a velocity term and per-parameter scaling. Robust and fast — the default for most deep learning.
Which to pick
- You want a strong default that "just works"
- Transformers, most modern nets
- You don't want to hand-tune schedules
- Training CNNs for best final accuracy
- You can afford to tune the schedule
- You want better generalization in some vision tasks
Adam with learning rate 0.001 and β₁=0.9, β₂=0.999 is the canonical starting point. AdamW (Adam + proper weight decay) is now standard for transformers.