Adam (2015) — The Optimizer Everyone Defaults To

The world before this paper

By 2014, training a deep network meant babysitting a learning rate. One global step size had to serve millions of parameters with wildly different gradient scales — too high and the loss exploded, too low and training crawled for days. The fixes on offer each solved one symptom and ignored the rest, and every new architecture meant retuning from scratch.

Plain SGD one rate for all

Every parameter gets the same learning rate. Sparse features crawl while dense ones oscillate — no single number fits both.

Momentum direction, not scale

Averaging past gradients smooths the direction of descent, but does nothing about per-parameter step size.

AdaGrad → RMSProp scale, half-solved

AdaGrad adapted each parameter's rate, but its ever-growing history decays the step size to zero. RMSProp patched that — in lecture slides, not a paper.

The key idea

Diederik Kingma, a PhD student in Amsterdam, and Jimmy Ba, a student in Geoffrey Hinton's Toronto lab, looked at this zoo of tricks and made a simple bet: momentum and RMSProp weren't rivals — they were two halves of one update. Momentum knew the direction; RMSProp knew the scale. Their paper — Kingma & Ba, "Adam: A Method for Stochastic Optimization", ICLR 2015 — bolted the two together and added one quietly crucial fix. Both moving averages start at zero, so early in training they underestimate the true gradient statistics. Adam divides out that cold-start bias, which is why it doesn't stumble through its first steps the way naive combinations did.

The paper in one sentence

Keep two exponential moving averages per parameter — the gradient (momentum, the direction) and the squared gradient (the scale) — correct their cold-start bias, and step each parameter by direction ÷ scale.

The whole update is just m̂ / (√v̂ + ε) per parameter — a momentum numerator over an RMSProp denominator. Want the full mechanics? See the Optimizers tour.

Watch it tame the valley

Below is the loss surface that breaks plain SGD: steep across, shallow along — a narrow canyon. Step through to watch SGD ricochet between the walls, momentum bend the ricochet into a curve, per-parameter scaling rebalance the steps, and Adam glide straight down the canyon floor.

The results that mattered

The paper's benchmarks mattered less than what practitioners discovered immediately: the same handful of numbers worked nearly everywhere. No per-problem tuning, no schedules to hand-craft — just call it and train.

The defaults β₁=0.9, β₂=0.999

The 2015 recommendations — with ε=1e−8 and a learning rate of 0.001 — still ship as the defaults in every major framework.

The bookkeeping 2 moving averages

That's all it takes to be both direction-aware and scale-aware: one average of gradients, one of squared gradients, per parameter.

The staying power ~10 years

And counting. A decade after publication, Adam and its successor AdamW remain the field's default optimizer.

Legacy — and the catch

What it unlocked

Robust across architectures with almost no tuning
Handles sparse and noisy gradients gracefully
The de-facto standard — every framework ships its defaults

The limits

Can generalize slightly worse than tuned SGD on some vision tasks
Stores two extra values per parameter — memory overhead
Original weight-decay handling was subtly wrong (fixed by AdamW)

That last limit got a proper fix in 2017: AdamW decoupled weight decay from the adaptive step, and that variant is what most transformers train with today. The core idea is untouched, though — ten years on, the one-line optimizer = Adam(...) in your framework is still the 2015 paper running essentially as written.

Go deeper

Read the original: arXiv:1412.6980. For the mechanics behind this story, see the Optimizers tour, Gradient descent, and Learning rate. Next paper: ResNet (2015).