Decision Trees

ML classification regression supervised tree-based interpretable

What it is

A decision tree is a flowchart of yes/no questions. You start at the top, answer one question at a time, and follow the matching branch down until you reach an answer.

It is exactly the game of 20 Questions. "Is it bigger than a breadbox? Does it have wings? Does it live in water?" Each answer rules out half the possibilities and sends you down a narrower path, until only one sensible guess remains. A decision tree learns those questions automatically from data: it looks for the single question that best separates the examples, asks it, then repeats the process on each resulting group until every group is as pure (single-class) as it can reasonably get.

In one sentence

A decision tree repeatedly splits the data on the most informative question, building a flowchart whose leaves give the prediction.

Every tree is built from four kinds of part:

Root node the first question

The top of the tree, where all the data enters. It holds the single best split over the entire dataset.

Internal node a question / split

A test on one feature, like age > 30?. It sends each example to one of its child branches.

Branch an answer path

The edge that carries examples from a question to the next node — typically a "yes" side and a "no" side.

Leaf the answer

A terminal node with no more questions. It outputs the prediction: a class label, or an average value for regression.

Why people love them

A trained tree is just a stack of if/else rules you can read out loud. That transparency — knowing exactly why a prediction was made — is why trees are the go-to model when a decision has to be explained to a human.

Where it's used

Anywhere a decision needs to be both accurate and explainable, or where a clear rule-of-thumb beats a black box.

Credit approval "approve or deny"

Income, debt, and history split applicants into risk buckets — and the lender can show exactly which rule triggered a denial.

Medical triage symptom flowcharts

A chain of yes/no checks routes a patient toward likely diagnoses — the same shape clinicians already reason with.

Churn prediction "will they leave?"

Usage, tenure, and support tickets split customers into stay/leave groups so retention teams can act early.

Feature importance what matters most

Features used for the top splits drive the most separation, giving a quick read on which inputs actually matter.

How it grows (the recipe)

A tree is built greedily, top-down. Starting from all the data at the root, it repeats the same four-step routine on every group it creates.

Step 1 pick best split

Try every feature and every threshold. For each candidate, imagine cutting the group in two and score how good the cut is.

Step 2 measure purity

Score each candidate by how pure the two resulting groups are, using Gini, entropy, or information gain. Keep the best.

Step 3 split & recurse

Apply the winning split, creating two child groups, then run the exact same procedure on each child.

Step 4 stop

Halt a branch when a group is pure, too small (min_samples), or the tree hits max_depth. That node becomes a leaf.

Gini vs entropy, in plain words

Both ask the same thing — "how mixed is this group?" Gini impurity is the chance you'd be wrong if you guessed a label at random from the group's mix; it is 0 when a group is all one class. Entropy measures the same messiness in bits of "surprise," and information gain is simply how much that entropy drops after a split. A split that cleanly separates the classes scores best on either one; in practice they pick almost the same trees, and Gini is just a touch cheaper to compute.

Watch the tree grow

The animation starts with a labelled 2D cloud — blue Class A and red Class B — that no single straight line separates cleanly. The tree finds the best axis-aligned cut, draws it, then recurses into the messy region and cuts again. Each split shades the regions by their majority class, and the description names the chosen feature, threshold, and resulting purity. By the end the plane is carved into a handful of pure regions — those are the leaves.

From splits to a tree

Those region cuts are a tree of yes/no questions. Each axis-aligned split above becomes an internal node; each pure region becomes a leaf. Watch the same two splits grow into the flowchart everyone pictures when they hear "decision tree".

Find the best split yourself

Here is the heart of step 2 — choosing where to cut. Below are twelve labelled points spread along a single feature axis. Drag the slider to move the dashed threshold and watch the counts, the per-side Gini impurity, and the weighted Gini of the whole split update live. The tree's goal is simply to find the threshold that makes that weighted Gini as small as possible — the orange marker shows where that best split lands.

class A class B best split (min Gini)

Overfitting & pruning

Left to grow freely, a tree keeps splitting until every leaf is perfectly pure — often a single point per leaf. That tree gets 100% accuracy on the training data and then fails on anything new, because it has memorized noise instead of learning the real boundary. Controlling the tree's size is the whole game.

Shallow tree (constrained)
  • Few splits — captures the broad structure
  • Generalizes to new data instead of memorizing
  • Easy to read and explain end to end
  • Risk: too shallow can underfit and miss real detail
Deep tree (unconstrained)
  • Many splits — fits every quirk and outlier
  • Overfits: near-perfect on training, poor on test data
  • A maze of rules no human can follow
  • Highly unstable — small data changes reshape it
Three ways to keep a tree honest

max_depth caps how many questions a path can ask, so the tree can't spiral into endless detail. min_samples_leaf refuses to create a leaf that covers too few examples, killing splits that only fit noise. Pruning grows the full tree first, then cuts back branches whose splits don't earn their keep on validation data — trading a little training accuracy for much better generalization.

When it works — and when it doesn't

Works well when
  • You need a model you can read and explain
  • Features mix numbers and categories with little prep
  • The boundary is naturally rule-like (axis-aligned thresholds)
  • You want a fast baseline that needs little scaling or tuning
Struggles when
  • A single tree overfits easily without pruning
  • The true boundary is a smooth diagonal — staircase splits approximate it poorly
  • Small data changes can reshape the whole tree
  • You need top accuracy — an ensemble (random forest, gradient boosting) almost always beats one tree
One tree is rarely the final answer

A lone decision tree is interpretable but fragile. The fix is to grow many of them and combine their votes: random forests average lots of de-correlated trees, while gradient boosting (XGBoost, LightGBM) builds trees in sequence, each fixing the last one's mistakes. These ensembles trade a little interpretability for state-of-the-art accuracy on tabular data.