Regularization: Ridge vs Lasso · Suman Bhadra Notes

Punish complexity to fight overfitting

An overfit model often has wild, enormous coefficients that whip its curve through every noisy point. Regularization adds a penalty on the size of the weights, nudging the model toward a smoother, simpler fit.

The new objective

minimize: prediction error + λ · (penalty on weights)

The strength λ tunes the trade-off: λ = 0 is plain regression; large λ forces small weights.

Watch the penalty tame an overfit curve

A wiggly overfit fit calms down as λ grows, and the coefficient bars below shrink. The final step contrasts how Ridge and Lasso shrink differently.

L2 vs L1 — the key difference

Ridge (L2) penalty = Σ wᵢ²

Penalizes squared weights. Shrinks all coefficients smoothly toward zero, but rarely to exactly zero. Keeps every feature.

Lasso (L1) penalty = Σ |wᵢ|

Penalizes absolute weights. Drives some coefficients to exactly zero — automatic feature selection.

Elastic Net L1 + L2

A blend of both — sparsity from L1, stability from L2. Good with many correlated features.

Why L1 zeros things out

The L1 penalty's diamond shape has sharp corners on the axes, so the optimum tends to land on an axis — where a coefficient is exactly zero. L2's round shape has no corners, so weights shrink but survive.

Choosing

Reach for Lasso when

You suspect many features are useless
You want a sparse, interpretable model
Automatic feature selection is valuable

Reach for Ridge when

You believe most features matter a little
Features are correlated (Lasso picks one arbitrarily)
You just want to stabilize a model

Tip

Always scale features before regularizing — the penalty treats all weights equally, so they must be on the same footing. Tune λ with cross-validation.