Ensemble Learning
What it is
Ensemble learning is the idea that a committee of mediocre models can beat any single expert — you train many imperfect models and combine their answers into one.
There is a famous county-fair story: a crowd was asked to guess the weight of an ox. No single guess was exactly right, but the average of all the guesses landed within a pound of the true weight — closer than any individual, including the cattle experts. Ensembles work the same way. One model overshoots, another undershoots, a third gets confused on a different set of cases — but pool them and the errors largely cancel, leaving an answer that is steadier and more accurate than any contributor.
It is the same instinct as asking several doctors for a second and third opinion before a big decision. Each may miss something, but it is unlikely they all miss the same thing.
Many imperfect models that make different mistakes can be averaged into one model that is better than any single one of them.
The magic is not "more models" — it is diverse, decorrelated errors. If each model is wrong in its own way, the wrong answers point in different directions and wash out in the average, while the correct signal they all share reinforces. Combine models that fail independently and the combination is more reliable than its parts.
The three families
Almost every ensemble method is a variation on one of these strategies for building the committee and tallying its votes.
Train many copies of the same model in parallel, each on a random bootstrap sample (rows drawn with replacement) of the data, then average their votes. Decorrelating the models smooths out the noise. Random Forest is the canonical example.
Train models one after another, where each new model focuses on the mistakes the previous ones made. The committee is built up gradually, chipping away at the error. AdaBoost and XGBoost are the famous ones.
Train several different base models (say a tree, a linear model, a KNN), then train a small meta-model that learns the best way to combine their predictions — instead of a fixed average.
The no-frills baseline: run a handful of models and take the majority vote (classification) or the mean (regression). Surprisingly effective when the models are genuinely different.
Bagging vs Boosting
Bagging and boosting are the two workhorses, and they pull in opposite directions. Bagging builds independent models in parallel to tame variance; boosting builds dependent models in sequence to tame bias.
- Parallel — every model trains at the same time, independently
- Each model sees a different bootstrap sample of the data
- Mainly reduces variance — averages away the noise of any one model
- Hard to overfit — adding more trees rarely hurts
- Sequential — each model is trained after, and depends on, the last
- Each model is steered toward the previous mistakes
- Mainly reduces bias — turns weak learners into a strong one
- Can overfit if you run too many rounds without regularization
The two flagship algorithms make the difference concrete. Random Forest is bagging applied to decision trees — hundreds of trees, each on a bootstrap sample and a random subset of features, all voting together. XGBoost is gradient boosting taken to its engineering limit — trees added one at a time, each correcting the residual errors of the ensemble so far. Both are decision-tree ensembles, but the recipe for building the committee could hardly be more different.
Watch the crowd beat the expert
The animation starts with two classes scattered on a plane. It then trains a few weak learners one at a time — each is a crude single-split boundary that gets a chunk of points wrong. No single one is any good. The final step overlays the combined boundary the ensemble votes for, and you can see it curve around the data in a way none of the individual splits could — with a lower error than any single learner.
Why diversity matters
This is the single most important idea, and the one beginners most often miss. An ensemble is only as good as the disagreement between its members.
If every model in the committee is the same and makes the same mistakes, averaging them just gives you back the same model — the errors are perfectly correlated, so nothing cancels. Ensembles need decorrelated errors to work, and that diversity is manufactured on purpose: bootstrap samples (each model sees different rows), random feature subsets (each model sees different columns), and different algorithms entirely (a tree and a linear model fail in different places). No diversity, no benefit.
Here is the payoff made concrete. Imagine a crowd of weak voters who are each only 65% accurate — barely better than a coin flip — but who make independent mistakes. Slide the number of voters N below and watch the majority-vote accuracy of the whole crowd climb far past any single voter.
This climb only happens because the voters fail independently. If every voter copied the same mistakes, the crowd would stay stuck at 65% no matter how many you add — adding correlated voters buys you nothing.
When it works — and when it doesn't
- You want the best possible accuracy on tabular data — ensembles routinely win here
- You need a model that is robust and not thrown off by a few noisy points
- Your base learners are genuinely different and make uncorrelated errors
- A small accuracy gain is worth the extra compute
- You need interpretability — a hundred trees is far harder to explain than one
- Compute and memory are tight — training and serving many models costs more
- The base learners are identical or too correlated — averaging adds nothing
- A single simple model is already good enough — the extra machinery is wasted
On messy real-world tabular problems, a tree ensemble — a Random Forest to start, an XGBoost model to squeeze out the last few points — is very often the strongest baseline you can reach for. Ensembles are why "just throw gradient-boosted trees at it" is such common advice for structured data.