Optimizers — Momentum, RMSprop, Adam

Deep Learning optimization Adam momentum

Why plain gradient descent isn't enough

Loss surfaces are full of long, narrow valleys. Plain gradient descent zig-zags across such a ravine — the gradient points mostly across the valley, not along it — so progress crawls.

Optimizers are smarter update rules that fix this. They remember past gradients and adapt the step, getting to the minimum far faster and more reliably.

Race them down a ravine

On an elongated valley, watch plain SGD bounce side to side while Momentum and Adam cut straight down the middle.

The big three

Momentum velocity

Accumulate a running average of gradients — like a heavy ball that builds speed downhill and damps the side-to-side wobble.

RMSprop adaptive step

Divide each parameter's step by a running average of its recent gradient size — big-gradient directions get smaller steps, and vice-versa.

Adam momentum + RMSprop

Combines both: a velocity term and per-parameter scaling. Robust and fast — the default for most deep learning.

Which to pick

Use Adam when
  • You want a strong default that "just works"
  • Transformers, most modern nets
  • You don't want to hand-tune schedules
Use SGD + Momentum when
  • Training CNNs for best final accuracy
  • You can afford to tune the schedule
  • You want better generalization in some vision tasks
Defaults

Adam with learning rate 0.001 and β₁=0.9, β₂=0.999 is the canonical starting point. AdamW (Adam + proper weight decay) is now standard for transformers.