L1 & L2 Regularization (Weight Decay)

Deep Learning weight decay regularization sparsity

Big weights make a nervous network

How does an overfit network thread its curve through every noisy training point? With huge weights — wild positive and negative values that bend the function violently between points. L1 and L2 regularization attack that directly: add the size of the weights to the loss, so the network pays for every big weight it keeps.

A model that solves the task with small weights computes a smoother, calmer function — and smooth functions generalize better. The penalty doesn't tell the network what to learn; it just makes flamboyant solutions expensive.

The new objective

minimize: loss(data) + λ · penalty(weights)

The strength λ sets the trade-off: λ = 0 means no regularization; crank it too high and the model underfits because being accurate is no longer worth the weight bill.

Two penalties, two personalities

L2 penalty λ · Σ w²

Adds 2λw to each weight's gradient — a pull proportional to its size. Big weights get yanked hardest; everything shrinks, nothing quite reaches zero.

L1 penalty λ · Σ |w|

Adds a constant λ·sign(w) to the gradient — the same steady tug whether a weight is 8.0 or 0.08. Small weights get dragged to exactly zero → sparsity.

The fingerprints dense vs sparse

L2 leaves many small weights; L1 leaves few surviving weights with the rest dead at zero. Same goal, very different-looking solutions.

Watch the penalties tame the weights

An overfit curve and its weight bars, then what L2 and L1 each do to them — and finally the weight-decay update w ← w·(1 − ηλ) ticking step by step.

L2 in deep learning = "weight decay"

Plug the L2 penalty into the plain SGD update and something neat falls out:

The decay step

w ← w · (1 − ηλ) − η · ∇loss

Before the usual gradient step, every weight gets multiplied by a number just below 1. Left alone, weights exponentially decay toward zero — they only stay large if the data keeps insisting they should. That's why deep learning folks say "weight decay" instead of "L2".

Why AdamW has a W

With Adam (see optimizers), putting L2 in the loss is not the same thing: the penalty's gradient gets rescaled by Adam's per-weight adaptive learning rates, so the weights with the busiest gradient history are barely regularized at all. AdamW decouples the decay — it applies w·(1 − ηλ) directly, outside the adaptive machinery (Loshchilov & Hutter, "Decoupled Weight Decay Regularization", 2017 / ICLR 2019). The decoupled decay is the W.

And L1? Sparsity

Because L1's pull is constant all the way down, weights don't just get small — they hit exactly zero and switch off. That's a feature-selection flavor: the surviving weights tell you what mattered. In classic ML this is the whole point of lasso; in deep nets L1 is rarely used alone, since architectures already control capacity and scattered zeros don't speed anything up without dedicated pruning.

The classic-ML view

There's a famous geometric picture for why L1 zeros things out — its diamond-shaped constraint has sharp corners on the axes, while L2's circle has none. That story (and when to choose ridge vs lasso vs elastic net) lives in Regularization: Ridge vs Lasso.

One member of a family

Weight decay shapes the weights; its siblings attack overfitting from other angles, and they stack nicely.

Dropout noise on neurons

Randomly switch off neurons during training so no feature can depend on another always being there.

Early stopping quit while ahead

Stop training when validation loss turns up — regularization by schedule, no penalty term needed.

Data augmentation more (fake) data

Flips, crops, noise — make the training set effectively bigger so there's less noise to memorize in the first place.

Sensible defaults

AdamW with weight_decay ≈ 0.01 is the boring, reliable starting point (transformer recipes often go up to 0.1). Combine it with early stopping and whatever augmentation your data allows — they're complementary, not competing.