XGBoost · Suman Bhadra Notes

What it is

XGBoost (Extreme Gradient Boosting) builds a sequence of small decision trees where each new tree is trained to fix the errors that all the trees before it left behind, and then everything is summed up with a small learning rate.

Picture a team working on a hard estimate. The first member makes a rough guess. Instead of starting over, the second member looks only at how wrong the first guess was and learns to predict that leftover error. The third member focuses on the mistakes the first two still kept making, and so on. No single member is brilliant, but each one specializes in the part the others got wrong — and adding their small corrections together produces a sharp answer. That is gradient boosting, and XGBoost is the fast, heavily engineered version of it.

In one sentence

Train trees one after another, where each tree predicts the leftover error (residual) of the current ensemble, and add them up scaled by a small learning rate.

XGBoost sits in the boosting family of ensemble learning, and the weak learners it stacks are decision trees. Three ideas combine to make it work:

Weak learner a shallow tree

Each round fits a small tree (often depth 3–6). Alone it is barely better than guessing — its job is just to capture one slice of the remaining error.

Residuals what's still wrong

After each round we measure how far the prediction is from the truth. The next tree is trained on those gaps, not on the original target.

Additive sum small steps add up

The final prediction is the baseline plus every tree's correction, each shrunk by the learning rate so no single tree overshoots.

Why it's called "gradient" boosting

The residual a tree chases is really the negative gradient of the loss function. For squared error that negative gradient is exactly (y − ŷ) — the plain residual — so each tree takes a step downhill on the loss, the same way gradient descent does, except the "parameter" being nudged is a whole new tree.

Boosting vs bagging

Both are ensembles — they combine many trees — but they pull in opposite directions. Bagging (as in a Random Forest) grows trees independently and averages them; boosting grows them in a chain, each one repairing the last.

Boosting (XGBoost)

Sequential — tree m can only be built after tree m−1
Each tree fits the residual errors of the ensemble so far
Mainly reduces bias — turns weak learners into a strong one
Trees are shallow and added with a learning rate

Bagging (Random Forest)

Parallel — every tree is grown independently on a bootstrap sample
Each tree fits the original target, then all votes are averaged
Mainly reduces variance — averages away the noise of deep trees
Trees are deep and treated as equals

The trade-off in a line

Bagging takes many low-bias, high-variance trees and averages out the variance. Boosting takes many high-bias, low-variance stumps and drives down the bias one correction at a time.

How gradient boosting works (the recipe)

Strip away the engineering and gradient boosting is a short, repeatable loop. You start with the dumbest possible model and keep nudging it toward the truth.

Step 1 baseline

Start with a single constant prediction for everything — typically the mean of the target for regression. It is wrong everywhere, but evenly.

Step 2 residuals

Compute how far each prediction is from the truth: r = y − ŷ. These gaps are the negative gradients of the loss — the direction of steepest improvement.

Step 3 fit a tree

Train a small tree whose target is the residuals. It learns where the model is over- or under-shooting and by how much.

Step 4 add, scaled by η

Add that tree's output to the running prediction, multiplied by a small learning rate η: ŷ ← ŷ + η · tree(x).

Step 5 repeat M times

Recompute residuals against the updated prediction and fit the next tree. Each round shaves off a little more error.

Result a strong ensemble

The final model is the baseline plus M shrunken trees. Together they trace a sharp prediction that no single shallow tree could.

Learning rate (shrinkage): smaller is wiser

The learning rate η (often 0.01–0.3) scales down every tree's contribution before it is added. A large η lets each tree overcommit and chase noise; a small η forces tiny, cautious steps, so the model leans on many trees instead of a few. The classic recipe is small η + more trees — it converges more smoothly and generalizes better, at the cost of longer training.

What makes XGBoost "extreme"

Plain gradient boosting is the idea; XGBoost is the implementation that made it win. Almost every difference is an engineering choice that makes the loop faster, more accurate, or harder to overfit.

Regularization L1 + L2 on leaves

XGBoost adds a penalty (lambda for L2, alpha for L1) on the leaf weights right inside the objective. Big leaf values cost something, so the model stays simple and overfits less.

Second-order gradients Newton, not just slope

It uses both the gradient and the second derivative (Hessian) of the loss. Knowing the curvature lets it pick smarter leaf values and split points than first-order boosting.

Missing values handled natively

Each split learns a default direction for missing entries. No imputation needed — XGBoost decides which branch a gap should fall into.

Subsampling rows and columns

Each tree can see a random fraction of rows (subsample) and features (colsample). This injects bagging-style variance reduction into the boosting loop.

Parallel split finding fast on real data

Feature values are pre-sorted into blocks so candidate splits across columns are evaluated in parallel. Trees are sequential, but building each one is not.

Pruning via gamma grow then trim

A split must reduce the loss by at least gamma to be kept. Branches that don't earn their complexity are pruned away.

Why it dominates tabular ML

This stack of tricks is exactly why XGBoost (and cousins like LightGBM and CatBoost) wins the majority of Kaggle competitions on structured/tabular data. On rows-and-columns problems it routinely beats deep neural networks while being faster to train and easier to tune.

Watch the residuals shrink

The animation plots ten data points and starts with the dumbest model: a flat line at the mean. As the legend shows, the blue curve is the model's prediction and the red lines are the residuals — how far each point sits from that prediction. Each step adds one more small tree, so the blue curve bends to hug the points and the red residual lines visibly shorten. Keep an eye on the round counter and the total squared error: it falls from over 50 toward zero as the ensemble grows.

Boosting = a sum of small trees

The same model written a different way: instead of a bending curve, the ensemble is literally a sequence of tiny trees added together — F = F₀ + η·tree₁ + η·tree₂ + η·tree₃ + …. Each tree below is a depth-1 regression stump: it asks one yes/no question and its two leaves hold numeric corrections (not class labels), each leaf adding a little to or subtracting a little from the running prediction. Step through to watch the running fit at the bottom snap closer to the points as each small tree joins the sum.

Key hyperparameters

XGBoost is famously tunable — which is both its strength and its tax. These are the dials that matter most, and they trade fitting power against overfitting.

n_estimators number of rounds (M)

How many trees to add. More trees fit harder, but past a point they start memorizing noise. Tuned together with the learning rate.

learning_rate shrinkage (η)

How much of each tree to keep. Lower means slower, steadier learning that usually generalizes better — pair it with more n_estimators.

max_depth tree complexity

How deep each tree can grow. Deeper trees capture feature interactions but overfit fast; 3–6 is the usual sweet spot.

subsample row sampling

Fraction of training rows used per tree (e.g. 0.8). Below 1.0 it adds randomness that fights overfitting and speeds training.

lambda / alpha regularization

L2 (lambda) and L1 (alpha) penalties on leaf weights. Turn them up to rein in a model that is overfitting.

gamma min split gain

Minimum loss reduction required to make a split. Higher gamma means a more conservative, pruned tree.

Early stopping does the work for you

Instead of guessing the right n_estimators, set it high and use early stopping: keep adding trees while a validation score improves and halt as soon as it stalls for a set number of rounds. You get the best ensemble size automatically and avoid training past the point of diminishing returns.

Feel the two most important dials yourself. Slide the number of rounds M and the learning rate η on the same ten points from the animation above. Watch the training error: more rounds drive it down, a tiny η barely moves the curve so it needs many more rounds to catch up, and a large η plus many rounds snaps the curve onto every single point — a jagged fit that has started memorizing the data instead of smoothing it.

rounds M learning rate η

prediction residual data point

When it works — and when it doesn't

Works well when

The data is tabular / structured — rows and columns of mixed numeric and categorical features
You want state-of-the-art accuracy without a deep-learning pipeline
There are missing values — it handles them natively, no imputation
You are willing to tune — it rewards careful hyperparameter search

Struggles when

There are many hyperparameters to tune — more dials than a Random Forest
The data is small and noisy — it can overfit unless heavily regularized
Training is sequential, so it is slower to fit than a fully parallel forest
The problem is images, audio, or text — deep neural networks win on raw, high-dimensional signals

The default for tables

When the problem is a spreadsheet and you need a strong baseline fast, XGBoost (or LightGBM / CatBoost) is the first thing to reach for. Save the neural networks for pixels and tokens.