Gradient Boosting
Learn from what's left over
Gradient boosting builds a model as a sum of small trees, added one at a time — where each new tree is trained to predict the error the ensemble still makes.
Start with a rough guess (often just the average). Look at the residuals — how far off you are. Fit a small tree to those residuals, add a fraction of it to your prediction, and repeat. Each round chips away at the remaining error.
For squared-error loss, the residual is the negative gradient of the loss. So fitting the residuals is really doing gradient descent — in function space, one tree per step.
Watch the residuals shrink
A wavy target. The prediction starts flat, then each added tree bends it closer to the data and the red residual bars get shorter round by round.
The knobs that matter
Add only a fraction of each tree. Small rate + many trees usually generalises best.
More rounds fit better — but too many overfit. Tune with early stopping.
Shallow trees (depth 3–6) keep each learner weak so boosting can do its work.
Train each tree on a random subset of rows/features to add regularisation.
This playground runs real boosting: every round fits an actual depth-1 stump to the current residuals of ten noisy training points (dots), while five held-out test points (diamonds) keep score. Turn the two knobs and watch them interact — especially what happens to the test error when ν = 1.0 meets many rounds.
Set ν = 1.0 and slide M up: train error crashes to ~0 while test error bottoms out early and climbs — the model is memorising noise. Now set ν = 0.1: progress is slower but the test error stays low far longer. That is why "small rate + many trees" is the default advice.
Boosting vs bagging, and the family
- Trees built one after another, each fixing errors
- Reduces bias — builds up complexity
- Powerful but can overfit noise
- Trees built independently, then averaged
- Reduces variance — see Random Forest
- More robust to noise, harder to overfit