Train–Test Split

ML evaluation generalization holdout

Don't grade the open-book answers

If you test a model on the very examples it trained on, you learn nothing about how it handles new data — and that's the only thing that matters.

So before training, set aside a slice of the data the model never sees. Train on the rest, then measure performance on the untouched slice. That held-out score is your honest estimate of real-world performance.

The split

Training set — the model learns from this. Test set — locked away until the very end, used once to report the final score.

Watch the split in action

The full dataset is shuffled, carved into 80% train and 20% test, the model fits the training points, and only then do we reveal how it does on the held-out test points.

Why a third set — validation

If you keep tweaking the model to improve the test score, the test set quietly leaks into your decisions and stops being honest. The fix is a three-way split:

Train ~60–80%

The model learns its parameters here.

Validation ~10–20%

Tune hyperparameters and compare models here — you can look at it often.

Test ~10–20%

Touched exactly once, at the end, for the final unbiased number.

Short on data?

Cross-validation reuses the data smarter — every row gets to be validation exactly once.

Getting it right

Do
  • Shuffle before splitting (unless it's time-series)
  • Stratify so each split keeps the class balance
  • Fit scalers/encoders on train only, then apply (transform) to both train and test
  • For time-series, train on the past, test on the future
Avoid (data leakage)
  • Scaling on the whole dataset before splitting
  • Duplicates that land in both sets
  • Peeking at the test set to pick a model