Handling Missing Values
Every real dataset has holes
A sensor drops out, a form field is left blank, a record is corrupted — and now a cell is empty. Most models can't train on a blank, so you must decide what to do with it.
There's no single right answer. The best choice depends on why the data is missing and how much of it is gone.
At random (a glitch) is safe to impute. Not at random (high earners skip the income field) means the blank itself carries information — handle with care.
See the strategies on one table
A small table with two missing cells in the age column. Watch dropping, then mean and median imputation, fill the gaps differently.
The toolbox
Simple and unbiased — but wastes data. Fine when only a few rows are affected.
If a feature is >50–70% missing, it may be more noise than signal.
Fill with the column's centre. Median resists outliers; mean is the default for symmetric data.
Fill a missing category with the most common value.
Predict the missing value from the other columns — more accurate, more work.
Add an "is_missing" column so the model can learn from the absence itself.
The cardinal rule
Compute the mean/median from the training data, then use that same value to fill blanks in validation and test. Computing it over the whole dataset leaks information — see Train–Test Split.
- Investigate why values are missing
- Use median for skewed numeric columns
- Consider a missing flag when absence is meaningful
- Blindly fill with 0 — it distorts the distribution
- Impute using test data statistics (leakage)
- Ignore that imputation shrinks variance