Learning Rate Explained
The most important dial
In the gradient descent update w ← w − η·gradient, the learning rate η sets how big a step to take. Get it wrong and training fails — get it right and everything else works better.
It's a Goldilocks problem. Too small and you'll wait forever; too large and you'll bounce past the minimum or blow up entirely.
Watch all three regimes
The same loss curve, descended with a learning rate that's too small, just right, and too large.
The three regimes
Tiny steps. Converges eventually, but training takes forever and may stall in a flat region.
Steady, fast progress straight to the minimum. The goal.
Overshoots the minimum, oscillates, and can diverge — loss shoots to NaN.
Schedules and finding a good value
Start larger to move fast, then reduce to settle precisely (step, cosine, exponential decay).
Begin tiny and ramp up over the first steps — standard for training transformers.
Increase the LR each step and plot loss; pick just before it blows up.
0.001 for Adam, 0.1 for SGD — then tune by factors of 10. Adaptive optimizers like Adam ease (but don't remove) the sensitivity.