Handling Missing Values

ML preprocessing imputation data cleaning

Every real dataset has holes

A sensor drops out, a form field is left blank, a record is corrupted — and now a cell is empty. Most models can't train on a blank, so you must decide what to do with it.

There's no single right answer. The best choice depends on why the data is missing and how much of it is gone.

First ask: why is it missing?

At random (a glitch) is safe to impute. Not at random (high earners skip the income field) means the blank itself carries information — handle with care.

See the strategies on one table

A small table with two missing cells in the age column. Watch dropping, then mean and median imputation, fill the gaps differently.

The toolbox

Drop rows dropna()

Simple and unbiased — but wastes data. Fine when only a few rows are affected.

Drop columns mostly empty

If a feature is >50–70% missing, it may be more noise than signal.

Mean / median numeric

Fill with the column's centre. Median resists outliers; mean is the default for symmetric data.

Mode categorical

Fill a missing category with the most common value.

Model-based KNN / regression

Predict the missing value from the other columns — more accurate, more work.

Missing indicator + flag column

Add an "is_missing" column so the model can learn from the absence itself.

The cardinal rule

Fit imputation on the training set only

Compute the mean/median from the training data, then use that same value to fill blanks in validation and test. Computing it over the whole dataset leaks information — see Train–Test Split.

Do
  • Investigate why values are missing
  • Use median for skewed numeric columns
  • Consider a missing flag when absence is meaningful
Don't
  • Blindly fill with 0 — it distorts the distribution
  • Impute using test data statistics (leakage)
  • Ignore that imputation shrinks variance