Gradient Clipping · Suman Bhadra Notes

One bad batch can launch your weights into space

Every training step updates the weights by w ← w − η·g. Most of the time the gradient g is modest. But the loss landscape has cliffs — and one batch with an exploding gradient produces an update so large it throws the weights somewhere terrible. The loss spikes to NaN, and a run that was going fine for hours dies in a single step.

The cliff steep loss walls

Some regions of the loss surface are nearly vertical. Land there and the gradient is suddenly hundreds of times bigger than usual.

One step, ruined run loss → NaN

A single oversized update can undo everything learned so far — or push the weights into a range where the loss overflows.

Who's at risk RNNs & big transformers

RNNs and LSTMs multiply gradients across time steps; large transformer pretraining hits loss spikes too.

Gradient clipping is the seatbelt: before each update, measure how big the gradient is — and if it's over a threshold, shrink it.

Watch a cliff wreck training — then watch clipping save it

A parameter slides down a loss surface with a cliff. See the unclipped catapult, the norm-clipped rescue, value vs norm clipping on the same gradient vector, and the two loss curves — with and without the safety rail.

Clip by norm — the standard

Gather every gradient in the model into one giant vector g and measure its length — the global norm ‖g‖. If it's within the threshold c, do nothing. If it's bigger, rescale the whole thing: g ← g · (c / ‖g‖). The gradient still points in exactly the same direction — it just can't be longer than c. In PyTorch it's one line: torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0).

Measure global norm ‖g‖

One norm over all parameters' gradients together — not per layer, not per weight.

Compare ‖g‖ > c ?

Within the threshold? The gradient passes through untouched — and most steps do.

Rescale g · (c / ‖g‖)

Shrink the whole vector down to length c. Direction preserved, length capped.

The simpler cousin is clip by value: clamp each component of g to [−c, c] independently. Easy to implement, but big components get flattened while small ones don't — the gradient's direction bends. That's why norm clipping is the usual default.

By norm g · (c / ‖g‖)

Caps the length, keeps the direction. The default in RNN and LLM training.

By value clamp(gᵢ, −c, c)

Caps each component on its own. Simpler, but it changes where the update points.

Using it well

Pick a threshold c ≈ 0.5–5

A max-norm of 1.0 is a common default. Log your gradient norms early in training and set c just above the typical range.

It's a safety rail not a cure

Clipping rescues the occasional bad batch. It won't fix a learning rate that's too high — that needs lowering, not capping.

Watch the clip rate how often it fires

Occasional clips: healthy. Clipping on nearly every step: your LR, init or data has a problem the clip is merely hiding.

A 2013 trick that never left

Norm clipping has been standard for recurrent networks since Pascanu et al. (2013), "On the difficulty of training recurrent neural networks" — and it's still routine in modern LLM pretraining, where one bad batch in a weeks-long run is too expensive to risk. Most large-model training configs ship with a max-norm of 1.0.

What clipping won't do

Clipping only ever shrinks gradients that are too big — it can't inflate ones that are too small. So it does nothing for the vanishing side of the gradient problem. That side needs architectural fixes: gated units like LSTMs and GRUs, residual connections that give gradients a highway back, and careful initialization.

Rule of thumb

Exploding gradients → clip. Vanishing gradients → change the architecture. One is a runtime guard; the other is a design problem.