Gradient Clipping
One bad batch can launch your weights into space
Every training step updates the weights by w ← w − η·g. Most of the time the gradient g is modest. But the loss landscape has cliffs — and one batch with an exploding gradient produces an update so large it throws the weights somewhere terrible. The loss spikes to NaN, and a run that was going fine for hours dies in a single step.
Some regions of the loss surface are nearly vertical. Land there and the gradient is suddenly hundreds of times bigger than usual.
A single oversized update can undo everything learned so far — or push the weights into a range where the loss overflows.
Gradient clipping is the seatbelt: before each update, measure how big the gradient is — and if it's over a threshold, shrink it.
Watch a cliff wreck training — then watch clipping save it
A parameter slides down a loss surface with a cliff. See the unclipped catapult, the norm-clipped rescue, value vs norm clipping on the same gradient vector, and the two loss curves — with and without the safety rail.
Clip by norm — the standard
Gather every gradient in the model into one giant vector g and measure its length — the global norm ‖g‖. If it's within the threshold c, do nothing. If it's bigger, rescale the whole thing: g ← g · (c / ‖g‖). The gradient still points in exactly the same direction — it just can't be longer than c. In PyTorch it's one line: torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0).
One norm over all parameters' gradients together — not per layer, not per weight.
Within the threshold? The gradient passes through untouched — and most steps do.
Shrink the whole vector down to length c. Direction preserved, length capped.
The simpler cousin is clip by value: clamp each component of g to [−c, c] independently. Easy to implement, but big components get flattened while small ones don't — the gradient's direction bends. That's why norm clipping is the usual default.
Caps the length, keeps the direction. The default in RNN and LLM training.
Caps each component on its own. Simpler, but it changes where the update points.
Using it well
A max-norm of 1.0 is a common default. Log your gradient norms early in training and set c just above the typical range.
Clipping rescues the occasional bad batch. It won't fix a learning rate that's too high — that needs lowering, not capping.
Occasional clips: healthy. Clipping on nearly every step: your LR, init or data has a problem the clip is merely hiding.
Norm clipping has been standard for recurrent networks since Pascanu et al. (2013), "On the difficulty of training recurrent neural networks" — and it's still routine in modern LLM pretraining, where one bad batch in a weeks-long run is too expensive to risk. Most large-model training configs ship with a max-norm of 1.0.
What clipping won't do
Clipping only ever shrinks gradients that are too big — it can't inflate ones that are too small. So it does nothing for the vanishing side of the gradient problem. That side needs architectural fixes: gated units like LSTMs and GRUs, residual connections that give gradients a highway back, and careful initialization.
Exploding gradients → clip. Vanishing gradients → change the architecture. One is a runtime guard; the other is a design problem.