Regularization: Ridge vs Lasso
Punish complexity to fight overfitting
An overfit model often has wild, enormous coefficients that whip its curve through every noisy point. Regularization adds a penalty on the size of the weights, nudging the model toward a smoother, simpler fit.
minimize: prediction error + λ · (penalty on weights)
The strength λ tunes the trade-off: λ = 0 is plain regression; large λ forces small weights.
Watch the penalty tame an overfit curve
A wiggly overfit fit calms down as λ grows, and the coefficient bars below shrink. The final step contrasts how Ridge and Lasso shrink differently.
L2 vs L1 — the key difference
Penalizes squared weights. Shrinks all coefficients smoothly toward zero, but rarely to exactly zero. Keeps every feature.
Penalizes absolute weights. Drives some coefficients to exactly zero — automatic feature selection.
A blend of both — sparsity from L1, stability from L2. Good with many correlated features.
The L1 penalty's diamond shape has sharp corners on the axes, so the optimum tends to land on an axis — where a coefficient is exactly zero. L2's round shape has no corners, so weights shrink but survive.
Choosing
- You suspect many features are useless
- You want a sparse, interpretable model
- Automatic feature selection is valuable
- You believe most features matter a little
- Features are correlated (Lasso picks one arbitrarily)
- You just want to stabilize a model
Always scale features before regularizing — the penalty treats all weights equally, so they must be on the same footing. Tune λ with cross-validation.