Ensemble Learning

ML ensemble bagging boosting stacking supervised

What it is

Ensemble learning is the idea that a committee of mediocre models can beat any single expert — you train many imperfect models and combine their answers into one.

There is a famous county-fair story: a crowd was asked to guess the weight of an ox. No single guess was exactly right, but the average of all the guesses landed within a pound of the true weight — closer than any individual, including the cattle experts. Ensembles work the same way. One model overshoots, another undershoots, a third gets confused on a different set of cases — but pool them and the errors largely cancel, leaving an answer that is steadier and more accurate than any contributor.

It is the same instinct as asking several doctors for a second and third opinion before a big decision. Each may miss something, but it is unlikely they all miss the same thing.

In one sentence

Many imperfect models that make different mistakes can be averaged into one model that is better than any single one of them.

Why it actually works

The magic is not "more models" — it is diverse, decorrelated errors. If each model is wrong in its own way, the wrong answers point in different directions and wash out in the average, while the correct signal they all share reinforces. Combine models that fail independently and the combination is more reliable than its parts.

The three families

Almost every ensemble method is a variation on one of these strategies for building the committee and tallying its votes.

Bagging parallel · reduces variance

Train many copies of the same model in parallel, each on a random bootstrap sample (rows drawn with replacement) of the data, then average their votes. Decorrelating the models smooths out the noise. Random Forest is the canonical example.

Boosting sequential · reduces bias

Train models one after another, where each new model focuses on the mistakes the previous ones made. The committee is built up gradually, chipping away at the error. AdaBoost and XGBoost are the famous ones.

Stacking a meta-model blends

Train several different base models (say a tree, a linear model, a KNN), then train a small meta-model that learns the best way to combine their predictions — instead of a fixed average.

Voting / Averaging the simple combine

The no-frills baseline: run a handful of models and take the majority vote (classification) or the mean (regression). Surprisingly effective when the models are genuinely different.

Bagging vs Boosting

Bagging and boosting are the two workhorses, and they pull in opposite directions. Bagging builds independent models in parallel to tame variance; boosting builds dependent models in sequence to tame bias.

Bagging
  • Parallel — every model trains at the same time, independently
  • Each model sees a different bootstrap sample of the data
  • Mainly reduces variance — averages away the noise of any one model
  • Hard to overfit — adding more trees rarely hurts
Boosting
  • Sequential — each model is trained after, and depends on, the last
  • Each model is steered toward the previous mistakes
  • Mainly reduces bias — turns weak learners into a strong one
  • Can overfit if you run too many rounds without regularization

The two flagship algorithms make the difference concrete. Random Forest is bagging applied to decision trees — hundreds of trees, each on a bootstrap sample and a random subset of features, all voting together. XGBoost is gradient boosting taken to its engineering limit — trees added one at a time, each correcting the residual errors of the ensemble so far. Both are decision-tree ensembles, but the recipe for building the committee could hardly be more different.

Watch the crowd beat the expert

The animation starts with two classes scattered on a plane. It then trains a few weak learners one at a time — each is a crude single-split boundary that gets a chunk of points wrong. No single one is any good. The final step overlays the combined boundary the ensemble votes for, and you can see it curve around the data in a way none of the individual splits could — with a lower error than any single learner.

Why diversity matters

This is the single most important idea, and the one beginners most often miss. An ensemble is only as good as the disagreement between its members.

Identical models gain you nothing

If every model in the committee is the same and makes the same mistakes, averaging them just gives you back the same model — the errors are perfectly correlated, so nothing cancels. Ensembles need decorrelated errors to work, and that diversity is manufactured on purpose: bootstrap samples (each model sees different rows), random feature subsets (each model sees different columns), and different algorithms entirely (a tree and a linear model fail in different places). No diversity, no benefit.

Here is the payoff made concrete. Imagine a crowd of weak voters who are each only 65% accurate — barely better than a coin flip — but who make independent mistakes. Slide the number of voters N below and watch the majority-vote accuracy of the whole crowd climb far past any single voter.

voted correctly voted wrong ensemble accuracy
The fine print: independence is everything

This climb only happens because the voters fail independently. If every voter copied the same mistakes, the crowd would stay stuck at 65% no matter how many you add — adding correlated voters buys you nothing.

When it works — and when it doesn't

Works well when
  • You want the best possible accuracy on tabular data — ensembles routinely win here
  • You need a model that is robust and not thrown off by a few noisy points
  • Your base learners are genuinely different and make uncorrelated errors
  • A small accuracy gain is worth the extra compute
Struggles when
  • You need interpretability — a hundred trees is far harder to explain than one
  • Compute and memory are tight — training and serving many models costs more
  • The base learners are identical or too correlated — averaging adds nothing
  • A single simple model is already good enough — the extra machinery is wasted
The practitioner's default

On messy real-world tabular problems, a tree ensemble — a Random Forest to start, an XGBoost model to squeeze out the last few points — is very often the strongest baseline you can reach for. Ensembles are why "just throw gradient-boosted trees at it" is such common advice for structured data.