Random Forest
What it is
A Random Forest is a crowd of decision trees: each tree is trained on a random slice of the data and a random subset of features, and then they all vote on the final answer.
Imagine you have a hard question and instead of asking one expert, you ask a thousand reasonably-informed people who each saw a slightly different part of the picture. Any single person might be wrong, but their individual mistakes tend to point in different directions and cancel out. The crowd's majority is usually far more reliable than any one voice. A Random Forest works the same way — one decision tree is a single opinionated expert that overfits, but a forest of them, each grown on different data, becomes a steady and accurate predictor.
Train many decision trees on random subsets of the data and features, then average their votes to get one robust prediction.
If you have already met the decision tree, a Random Forest is exactly what it sounds like — a whole forest of them. It is the most popular form of ensemble learning known as bagging (bootstrap aggregating) applied to decision trees, with one extra twist: each split also looks at only a random handful of features.
Each member of the forest is a full, deep decision tree — high accuracy on its own data, but high variance.
Every tree is grown on its own bootstrap sample of the rows, so no two trees see exactly the same data.
Classification takes the majority class across trees. Regression averages their numeric predictions.
A single deep tree memorizes its training data and swings wildly when the data changes — it has high variance. Averaging many such trees keeps their shared signal and cancels their independent errors, which is why a forest is so much steadier than any tree inside it.
Where it's used
Random Forest is the go-to first model for almost any table of numbers and categories. It is fast to train, needs little tuning, and rarely embarrasses you.
Flags suspicious payments by learning the messy, non-linear patterns that separate fraud from normal activity.
Ranks features by how much they reduce error across the forest — a quick read on what drives the target. Deeper tools: SHAP & LIME.
Combines lab values, symptoms, and history to estimate the likelihood of a condition, robust to noisy inputs.
Predicts loan default and is the standard strong baseline on tabular competitions before reaching for boosting.
Two sources of randomness
The word "Random" in Random Forest is doing real work. Each tree is deliberately given an incomplete, randomized view of the problem — and that is the secret to the whole method. There are two independent dice rolls.
Each tree is trained on a sample-with-replacement of the rows — about 63% of the unique rows appear, some more than once, and the rest are left out. Every tree sees a different dataset.
At every split, a tree may only consider a random subset of the features (often √p of the p columns). This stops every tree from leaning on the same dominant feature.
Both tricks decorrelate the trees. If every tree saw the same data and the same features, they would all make the same mistakes, and averaging would change nothing. By forcing each tree to look elsewhere, their errors become independent — so when you average, the errors cancel while the true signal survives.
How it works (the recipe)
Building a Random Forest is a loop: make a random dataset, grow a tree on it, and repeat. Prediction is just collecting and combining everyone's answer.
Draw a random sample-with-replacement of the training rows. This is the data for one tree.
Grow a deep decision tree on that sample. At each split, choose the best split from a random subset of the features.
Repeat Steps 1–2 to build N trees (often hundreds). Each gets its own bootstrap sample and its own random splits.
To predict, run the input through every tree. Take the majority vote for classification or the average for regression.
Because each tree is trained on only ~63% of the rows, the remaining ~37% are "out-of-bag" for that tree. You can test every row against just the trees that never saw it, giving an honest estimate of generalization error without setting aside a separate validation set.
Watch the forest vote
The animation starts with a labelled cloud (blue Class A and red Class B), then grows three trees — each on its own bootstrap sample, so each draws a slightly different blocky, axis-aligned boundary and mis-labels a few points. A new query point is then dropped in; each tree casts a vote, the majority wins, and the final frame shows the smooth aggregated boundary the whole forest agrees on.
A forest of little trees
The boundary view above shows where each tree splits the plane. Here is the same idea drawn as the trees themselves: three small decision trees, each grown on a different bootstrap sample so they ask different questions. Watch one query example fall down all three, reach a leaf in each, and have its votes tallied into a majority.
Single tree vs forest
The whole point of the forest is what happens when you go from one tree to many. A lone tree and a forest are built from the same ingredient but behave very differently.
- Jagged boundary — carves the space into sharp axis-aligned boxes around individual points
- Overfits — memorizes noise and quirks of the training set
- High variance — a small change in the data reshapes the whole tree
- Easy to read, but a fragile predictor on new data
- Smooth boundary — averaging hundreds of jagged trees rounds off the edges
- Robust — independent errors cancel, so noise has little effect
- Lower variance — the prediction barely moves when the data is perturbed
- Harder to inspect, but far more accurate out of the box
Averaging many noisy-but-unbiased predictors keeps the bias the same while shrinking the variance. That is the entire mathematical reason a forest beats its trees — and why decorrelating them (the two random dice rolls) makes the shrinkage even stronger.
See it for yourself. Slide the number of trees N from 1 to 25 and watch the shaded decision region. With one tree the boundary is a single hard, blocky edge (overfit); as you add trees — each with its own slightly-shifted split — their votes blend into a soft, stable boundary. The fixed orange query point keeps a live vote tally so you can watch its prediction settle as the forest grows.
When it works — and when it doesn't
- You need a strong tabular default with little tuning
- Features are mixed — numeric and categorical, on different scales
- You want a model that resists overfitting out of the box
- You need a quick read on feature importance
- Inference must be fast or small — many trees cost memory and latency
- You need a model a human can fully interpret — one tree is far clearer
- Data is high-dimensional and sparse (text) — gradient boosting often wins
- You need to extrapolate beyond the training range — trees only predict within it
Random Forest grows trees in parallel and independently, then averages — it reduces variance. Gradient boosting grows trees sequentially, each fixing the previous one's mistakes — it reduces bias. Forests are the safer, lower-tuning default; boosting usually squeezes out higher accuracy when you invest in tuning.