Data Leakage & Pipelines · Suman Bhadra Notes

The model that aced the test by peeking

Data leakage is information from outside the training data — usually from the test set, or from the future — sneaking into training. The model isn't smarter; it's cheating, and you graded the cheating as skill.

The cruel part is that everything looks fine. Your code runs, your train–test split exists, your validation score is fantastic. Then the model meets real data, where the leaked information isn't available, and the score falls off a cliff.

The symptom

Suspiciously great validation scores, terrible production performance. If a model looks too good to be true on day one, suspect leakage before you suspect genius.

The classic ways it sneaks in

Leakage rarely announces itself. It hides in four familiar disguises:

Preprocessing leakage fit before split

Fitting a scaler, encoder or imputer on all the data before splitting. The test set's mean, categories and gaps contaminate the training transform.

Target leakage proxy for the answer

A feature that quietly is the label — like antibiotic_given when predicting infection. It's only filled in after the diagnosis, so it won't exist at prediction time.

Temporal leakage future → past

Random splits on time-series data let future-dated rows leak into training, so the model learns from information it would not have at prediction time. Always split by time: train on the past, test on the future.

Duplicate / group leakage same entity twice

The same patient, user or document lands in both train and test. The model recognizes the individual instead of learning the pattern. Split by group, not by row.

The golden rule: fit on train, transform everything else

Every defense against preprocessing leakage is the same three moves, in the same order:

1 · Split first before anything

Do the train–test split before any statistic is computed. The test set goes in the vault immediately.

2 · Fit on train only learn the stats

Means, scales, categories, imputation values — every preprocessing step learns its numbers from the training fold alone.

3 · Freeze & apply transform test

Apply the already-fitted transforms to validation and test. They get transformed with train's numbers, never fitted on their own.

Why this works

At prediction time, production data will be transformed with statistics learned from training data — because that's all you'll have. Treating the test set the same way is what makes its score an honest preview of production.

Watch leakage happen — then watch the fix

First the wrong order: a scaler fits on the full dataset and quietly absorbs test statistics. Then the right order, a target-leakage trap, and finally a pipeline keeping every cross-validation fold clean.

Pipelines make the rule automatic

You can follow the golden rule by hand, but discipline doesn't survive a 2 a.m. refactor. The structural fix is scikit-learn's Pipeline: it chains your preprocessing and your model — say scaler → encoder → classifier — into one object with a single fit and predict. Calling fit fits each step on training data only, in order; calling predict applies the frozen transforms and then the model. There is simply no moment where the test set can touch a fitting step.

The real payoff shows up in cross-validation. Pass the pipeline itself to cross_val_score and the entire chain is cloned and refitted from scratch inside every fold — the scaler relearns its mean on each fold's training portion, never on that fold's validation rows. Scale once up front and then cross-validate, and every fold is contaminated; wrap it in a pipeline and leakage becomes structurally impossible, not just discouraged.

Rule of thumb

If a step has a fit method, it belongs inside the pipeline. Anything fitted outside the pipeline is a leak waiting to happen.

Smell tests

Pipelines stop preprocessing leakage, but target and group leakage live in the data itself — you have to sniff those out:

Warning signs of leakage

Too good to be true — validation jumps to near-perfect on a problem that should be hard. One dominant feature — a single feature towers over all others in importance; ask whether it would exist before the label does. Fragile brilliance — remove one feature and the score collapses from 0.99 to 0.65. The production gap — the deployed model performs far worse than validation ever did. Any one of these means: stop tuning, start auditing.