Train–Test Split
Don't grade the open-book answers
If you test a model on the very examples it trained on, you learn nothing about how it handles new data — and that's the only thing that matters.
So before training, set aside a slice of the data the model never sees. Train on the rest, then measure performance on the untouched slice. That held-out score is your honest estimate of real-world performance.
Training set — the model learns from this. Test set — locked away until the very end, used once to report the final score.
Watch the split in action
The full dataset is shuffled, carved into 80% train and 20% test, the model fits the training points, and only then do we reveal how it does on the held-out test points.
Why a third set — validation
If you keep tweaking the model to improve the test score, the test set quietly leaks into your decisions and stops being honest. The fix is a three-way split:
The model learns its parameters here.
Tune hyperparameters and compare models here — you can look at it often.
Touched exactly once, at the end, for the final unbiased number.
Cross-validation reuses the data smarter — every row gets to be validation exactly once.
Getting it right
- Shuffle before splitting (unless it's time-series)
- Stratify so each split keeps the class balance
- Fit scalers/encoders on train only, then apply (transform) to both train and test
- For time-series, train on the past, test on the future
- Scaling on the whole dataset before splitting
- Duplicates that land in both sets
- Peeking at the test set to pick a model