XGBoost
What it is
XGBoost (Extreme Gradient Boosting) builds a sequence of small decision trees where each new tree is trained to fix the errors that all the trees before it left behind, and then everything is summed up with a small learning rate.
Picture a team working on a hard estimate. The first member makes a rough guess. Instead of starting over, the second member looks only at how wrong the first guess was and learns to predict that leftover error. The third member focuses on the mistakes the first two still kept making, and so on. No single member is brilliant, but each one specializes in the part the others got wrong — and adding their small corrections together produces a sharp answer. That is gradient boosting, and XGBoost is the fast, heavily engineered version of it.
Train trees one after another, where each tree predicts the leftover error (residual) of the current ensemble, and add them up scaled by a small learning rate.
XGBoost sits in the boosting family of ensemble learning, and the weak learners it stacks are decision trees. Three ideas combine to make it work:
Each round fits a small tree (often depth 3–6). Alone it is barely better than guessing — its job is just to capture one slice of the remaining error.
After each round we measure how far the prediction is from the truth. The next tree is trained on those gaps, not on the original target.
The final prediction is the baseline plus every tree's correction, each shrunk by the learning rate so no single tree overshoots.
The residual a tree chases is really the negative gradient of the loss function. For squared error that negative gradient is exactly (y − ŷ) — the plain residual — so each tree takes a step downhill on the loss, the same way gradient descent does, except the "parameter" being nudged is a whole new tree.
Boosting vs bagging
Both are ensembles — they combine many trees — but they pull in opposite directions. Bagging (as in a Random Forest) grows trees independently and averages them; boosting grows them in a chain, each one repairing the last.
- Sequential — tree
mcan only be built after treem−1 - Each tree fits the residual errors of the ensemble so far
- Mainly reduces bias — turns weak learners into a strong one
- Trees are shallow and added with a learning rate
- Parallel — every tree is grown independently on a bootstrap sample
- Each tree fits the original target, then all votes are averaged
- Mainly reduces variance — averages away the noise of deep trees
- Trees are deep and treated as equals
Bagging takes many low-bias, high-variance trees and averages out the variance. Boosting takes many high-bias, low-variance stumps and drives down the bias one correction at a time.
How gradient boosting works (the recipe)
Strip away the engineering and gradient boosting is a short, repeatable loop. You start with the dumbest possible model and keep nudging it toward the truth.
Start with a single constant prediction for everything — typically the mean of the target for regression. It is wrong everywhere, but evenly.
Compute how far each prediction is from the truth: r = y − ŷ. These gaps are the gradients of the loss — the direction of steepest improvement.
Train a small tree whose target is the residuals. It learns where the model is over- or under-shooting and by how much.
Add that tree's output to the running prediction, multiplied by a small learning rate η: ŷ ← ŷ + η · tree(x).
Recompute residuals against the updated prediction and fit the next tree. Each round shaves off a little more error.
The final model is the baseline plus M shrunken trees. Together they trace a sharp prediction that no single shallow tree could.
The learning rate η (often 0.01–0.3) scales down every tree's contribution before it is added. A large η lets each tree overcommit and chase noise; a small η forces tiny, cautious steps, so the model leans on many trees instead of a few. The classic recipe is small η + more trees — it converges more smoothly and generalizes better, at the cost of longer training.
What makes XGBoost "extreme"
Plain gradient boosting is the idea; XGBoost is the implementation that made it win. Almost every difference is an engineering choice that makes the loop faster, more accurate, or harder to overfit.
XGBoost adds a penalty (lambda for L2, alpha for L1) on the leaf weights right inside the objective. Big leaf values cost something, so the model stays simple and overfits less.
It uses both the gradient and the second derivative (Hessian) of the loss. Knowing the curvature lets it pick smarter leaf values and split points than first-order boosting.
Each split learns a default direction for missing entries. No imputation needed — XGBoost decides which branch a gap should fall into.
Each tree can see a random fraction of rows (subsample) and features (colsample). This injects bagging-style variance reduction into the boosting loop.
Feature values are pre-sorted into blocks so candidate splits across columns are evaluated in parallel. Trees are sequential, but building each one is not.
A split must reduce the loss by at least gamma to be kept. Branches that don't earn their complexity are pruned away.
This stack of tricks is exactly why XGBoost (and cousins like LightGBM and CatBoost) wins the majority of Kaggle competitions on structured/tabular data. On rows-and-columns problems it routinely beats deep neural networks while being faster to train and easier to tune.
Watch the residuals shrink
The animation plots ten data points and starts with the dumbest model: a flat line at the mean. As the legend shows, the blue curve is the model's prediction and the red lines are the residuals — how far each point sits from that prediction. Each step adds one more small tree, so the blue curve bends to hug the points and the red residual lines visibly shorten. Keep an eye on the round counter and the total squared error: it falls from over 50 toward zero as the ensemble grows.
Boosting = a sum of small trees
The same model written a different way: instead of a bending curve, the ensemble is literally a sequence of tiny trees added together — F = tree1 + η·tree2 + η·tree3 + …. Each tree below is a depth-1 regression stump: it asks one yes/no question and its two leaves hold numeric corrections (not class labels), each leaf adding a little to or subtracting a little from the running prediction. Step through to watch the running fit at the bottom snap closer to the points as each small tree joins the sum.
Key hyperparameters
XGBoost is famously tunable — which is both its strength and its tax. These are the dials that matter most, and they trade fitting power against overfitting.
How many trees to add. More trees fit harder, but past a point they start memorizing noise. Tuned together with the learning rate.
How much of each tree to keep. Lower means slower, steadier learning that usually generalizes better — pair it with more n_estimators.
How deep each tree can grow. Deeper trees capture feature interactions but overfit fast; 3–6 is the usual sweet spot.
Fraction of training rows used per tree (e.g. 0.8). Below 1.0 it adds randomness that fights overfitting and speeds training.
L2 (lambda) and L1 (alpha) penalties on leaf weights. Turn them up to rein in a model that is overfitting.
Minimum loss reduction required to make a split. Higher gamma means a more conservative, pruned tree.
Instead of guessing the right n_estimators, set it high and use early stopping: keep adding trees while a validation score improves and halt as soon as it stalls for a set number of rounds. You get the best ensemble size automatically and avoid training past the point of diminishing returns.
Feel the two most important dials yourself. Slide the number of rounds M and the learning rate η on the same ten points from the animation above. Watch the training error: more rounds drive it down, a tiny η barely moves the curve so it needs many more rounds to catch up, and a large η plus many rounds snaps the curve onto every single point — a jagged fit that has started memorizing the data instead of smoothing it.
When it works — and when it doesn't
- The data is tabular / structured — rows and columns of mixed numeric and categorical features
- You want state-of-the-art accuracy without a deep-learning pipeline
- There are missing values — it handles them natively, no imputation
- You are willing to tune — it rewards careful hyperparameter search
- There are many hyperparameters to tune — more dials than a Random Forest
- The data is small and noisy — it can overfit unless heavily regularized
- Training is sequential, so it is slower to fit than a fully parallel forest
- The problem is images, audio, or text — deep neural networks win on raw, high-dimensional signals
When the problem is a spreadsheet and you need a strong baseline fast, XGBoost (or LightGBM / CatBoost) is the first thing to reach for. Save the neural networks for pixels and tokens.