Gradient Boosting

ML ensemble boosting residuals supervised

Learn from what's left over

Gradient boosting builds a model as a sum of small trees, added one at a time — where each new tree is trained to predict the error the ensemble still makes.

Start with a rough guess (often just the average). Look at the residuals — how far off you are. Fit a small tree to those residuals, add a fraction of it to your prediction, and repeat. Each round chips away at the remaining error.

Why "gradient"?

For squared-error loss, the residual is the negative gradient of the loss. So fitting the residuals is really doing gradient descent — in function space, one tree per step.

Watch the residuals shrink

A wavy target. The prediction starts flat, then each added tree bends it closer to the data and the red residual bars get shorter round by round.

The knobs that matter

Learning rate shrinkage

Add only a fraction of each tree. Small rate + many trees usually generalises best.

Number of trees rounds

More rounds fit better — but too many overfit. Tune with early stopping.

Tree depth weak learners

Shallow trees (depth 3–6) keep each learner weak so boosting can do its work.

Subsampling stochastic

Train each tree on a random subset of rows/features to add regularisation.

This playground runs real boosting: every round fits an actual depth-1 stump to the current residuals of ten noisy training points (dots), while five held-out test points (diamonds) keep score. Turn the two knobs and watch them interact — especially what happens to the test error when ν = 1.0 meets many rounds.

Set ν = 1.0 and slide M up: train error crashes to ~0 while test error bottoms out early and climbs — the model is memorising noise. Now set ν = 0.1: progress is slower but the test error stays low far longer. That is why "small rate + many trees" is the default advice.

Boosting vs bagging, and the family

Boosting (sequential)
  • Trees built one after another, each fixing errors
  • Reduces bias — builds up complexity
  • Powerful but can overfit noise
Bagging (parallel)
  • Trees built independently, then averaged
  • Reduces variance — see Random Forest
  • More robust to noise, harder to overfit
In practice

AdaBoost was the first big boosting method; gradient boosting generalised it to any differentiable loss. XGBoost, LightGBM and CatBoost are fast, regularised implementations that dominate tabular competitions.