Random Forest · Suman Bhadra Notes

What it is

A Random Forest is a crowd of decision trees: each tree is trained on a random slice of the data and a random subset of features, and then they all vote on the final answer.

Imagine you have a hard question and instead of asking one expert, you ask a thousand reasonably-informed people who each saw a slightly different part of the picture. Any single person might be wrong, but their individual mistakes tend to point in different directions and cancel out. The crowd's majority is usually far more reliable than any one voice. A Random Forest works the same way — one decision tree is a single opinionated expert that overfits, but a forest of them, each grown on different data, becomes a steady and accurate predictor.

In one sentence

Train many decision trees on random subsets of the data and features, then average their votes to get one robust prediction.

If you have already met the decision tree, a Random Forest is exactly what it sounds like — a whole forest of them. It is the most popular form of ensemble learning known as bagging (bootstrap aggregating) applied to decision trees, with one extra twist: each split also looks at only a random handful of features.

Base learner a decision tree

Each member of the forest is a full, deep decision tree — high accuracy on its own data, but high variance.

Bagging train on random samples

Every tree is grown on its own bootstrap sample of the rows, so no two trees see exactly the same data.

Aggregation vote or average

Classification takes the majority class across trees. Regression averages their numeric predictions.

Why a crowd beats one expert

A single deep tree memorizes its training data and swings wildly when the data changes — it has high variance. Averaging many such trees keeps their shared signal and cancels their independent errors, which is why a forest is so much steadier than any tree inside it.

Where it's used

Random Forest is the go-to first model for almost any table of numbers and categories. It is fast to train, needs little tuning, and rarely embarrasses you.

Fraud detection spotting bad transactions

Flags suspicious payments by learning the messy, non-linear patterns that separate fraud from normal activity.

Feature importance which inputs matter

Ranks features by how much they reduce error across the forest — a quick read on what drives the target. Deeper tools: SHAP & LIME.

Medical diagnosis risk from many signals

Combines lab values, symptoms, and history to estimate the likelihood of a condition, robust to noisy inputs.

Default risk tabular Kaggle baseline

Predicts loan default and is the standard strong baseline on tabular competitions before reaching for boosting.

Two sources of randomness

The word "Random" in Random Forest is doing real work. Each tree is deliberately given an incomplete, randomized view of the problem — and that is the secret to the whole method. There are two independent dice rolls.

(a) Bootstrap sampling random rows

Each tree is trained on a sample-with-replacement of the rows — about 63% of the unique rows appear, some more than once, and the rest are left out. Every tree sees a different dataset.

(b) Feature subsampling random columns

At every split, a tree may only consider a random subset of the features (often √p of the p columns). This stops every tree from leaning on the same dominant feature.

Why randomness helps

Both tricks decorrelate the trees. If every tree saw the same data and the same features, they would all make the same mistakes, and averaging would change nothing. By forcing each tree to look elsewhere, their errors become independent — so when you average, the errors cancel while the true signal survives.

How it works (the recipe)

Building a Random Forest is a loop: make a random dataset, grow a tree on it, and repeat. Prediction is just collecting and combining everyone's answer.

Step 1 bootstrap

Draw a random sample-with-replacement of the training rows. This is the data for one tree.

Step 2 grow a tree

Grow a deep decision tree on that sample. At each split, choose the best split from a random subset of the features.

Step 3 repeat × N

Repeat Steps 1–2 to build N trees (often hundreds). Each gets its own bootstrap sample and its own random splits.

Step 4 aggregate

To predict, run the input through every tree. Take the majority vote for classification or the average for regression.

Out-of-bag (OOB) error — free validation

Because each tree is trained on only ~63% of the rows, the remaining ~37% are "out-of-bag" for that tree. You can test every row against just the trees that never saw it, giving an honest estimate of generalization error without setting aside a separate validation set.

Watch the forest vote

The animation starts with a labelled cloud (blue Class A and red Class B), then grows three trees — each on its own bootstrap sample, so each draws a slightly different blocky, axis-aligned boundary and mis-labels a point or two. A new query point is then dropped in; each tree casts a vote, the majority wins, and the final frame shows the smooth aggregated boundary the whole forest agrees on.

A forest of little trees

The boundary view above shows where each tree splits the plane. Here is the same idea drawn as the trees themselves: three small decision trees, each grown on a different bootstrap sample so they ask different questions. Watch one query example fall down all three, reach a leaf in each, and have its votes tallied into a majority.

Single tree vs forest

The whole point of the forest is what happens when you go from one tree to many. A lone tree and a forest are built from the same ingredient but behave very differently.

A single deep tree

Jagged boundary — carves the space into sharp axis-aligned boxes around individual points
Overfits — memorizes noise and quirks of the training set
High variance — a small change in the data reshapes the whole tree
Easy to read, but a fragile predictor on new data

A forest of trees

Smooth boundary — averaging hundreds of jagged trees rounds off the edges
Robust — independent errors cancel, so noise has little effect
Lower variance — the prediction barely moves when the data is perturbed
Harder to inspect, but far more accurate out of the box

Variance reduction in one line

Averaging many noisy-but-unbiased predictors keeps the bias the same while shrinking the variance. That is the entire mathematical reason a forest beats its trees — and why decorrelating them (the two random dice rolls) makes the shrinkage even stronger.

See it for yourself. Slide the number of trees N from 1 to 25 and watch the shaded decision region. With one tree the boundary is a single hard, blocky edge (overfit); as you add trees — each with its own slightly-shifted split — their votes blend into a soft, stable boundary. The fixed orange query point keeps a live vote tally so you can watch its prediction settle as the forest grows.

Trees N

class A class B query point

When it works — and when it doesn't

Works well when

You need a strong tabular default with little tuning
Features are mixed — numeric and categorical, on different scales
You want a model that resists overfitting out of the box
You need a quick read on feature importance

Struggles when

Inference must be fast or small — many trees cost memory and latency
You need a model a human can fully interpret — one tree is far clearer
Data is high-dimensional and sparse (text) — gradient boosting often wins
You need to extrapolate beyond the training range — trees only predict within it

Forest vs boosting

Random Forest grows trees in parallel and independently, then averages — it reduces variance. Gradient boosting grows trees sequentially, each fixing the previous one's mistakes — it reduces bias. Forests are the safer, lower-tuning default; boosting usually squeezes out higher accuracy when you invest in tuning.