Learning Rate Explained · Suman Bhadra Notes

The most important dial

In the gradient descent update w ← w − η·gradient, the learning rate η sets how big a step to take. Get it wrong and training fails — get it right and everything else works better.

It's a Goldilocks problem. Too small and you'll wait forever; too large and you'll bounce past the minimum or blow up entirely.

Watch all three regimes

The same loss curve, descended with a learning rate that's too small, just right, and too large.

The three regimes

Too small η too small

Tiny steps. Converges eventually, but training takes forever and may stall in a flat region.

Just right η balanced

Steady, fast progress straight to the minimum. The goal.

Too large η too big

Overshoots the minimum, oscillates, and can diverge — loss shoots to NaN.

Schedules and finding a good value

Decay shrink over time

Start larger to move fast, then reduce to settle precisely (step, cosine, exponential decay).

Warmup ramp up first

Begin tiny and ramp up over the first steps — standard for training transformers.

LR finder sweep it

Increase the LR each step and plot loss; pick just before it blows up.

Good starting points

0.001 for Adam, 0.1 for SGD — then tune by factors of 10. Adaptive optimizers like Adam ease (but don't remove) the sensitivity.