Train–Test Split · Suman Bhadra Notes

Don't grade the open-book answers

If you test a model on the very examples it trained on, you learn nothing about how it handles new data — and that's the only thing that matters.

So before training, set aside a slice of the data the model never sees. Train on the rest, then measure performance on the untouched slice. That held-out score is your honest estimate of real-world performance.

The split

Training set — the model learns from this. Test set — locked away until the very end, used once to report the final score.

Watch the split in action

The full dataset is shuffled, carved into 80% train and 20% test, the model fits the training points, and only then do we reveal how it does on the held-out test points.

Why a third set — validation

If you keep tweaking the model to improve the test score, the test set quietly leaks into your decisions and stops being honest. The fix is a three-way split:

Train ~60–80%

The model learns its parameters here.

Validation ~10–20%

Tune hyperparameters and compare models here — you can look at it often.

Test ~10–20%

Touched exactly once, at the end, for the final unbiased number.

Short on data?

Cross-validation reuses the data smarter — every row gets to be validation exactly once.

Getting it right

Do

Shuffle before splitting (unless it's time-series)
Stratify so each split keeps the class balance
Fit scalers/encoders on train only, then apply (transform) to both train and test
For time-series, train on the past, test on the future

Avoid (data leakage)

Scaling on the whole dataset before splitting
Duplicates that land in both sets
Peeking at the test set to pick a model