Feature Scaling

ML preprocessing normalization standardization

When units hijack the model

Suppose you predict from age (0–80) and salary (0–100,000). To a distance-based model, a €10,000 gap dwarfs a 40-year gap — salary drowns out age purely because of its scale.

Feature scaling fixes this by rescaling every feature to a comparable range, so each one contributes on its merits, not its units.

Who needs it

Distance-based (KNN, K-Means, SVM) and gradient-based (linear/logistic regression, neural nets) models. Tree-based models (trees, forests) don't care — they split one feature at a time.

Two features, two scalings

The same points plotted three ways: raw (salary axis dominates), min-max normalized to [0, 1], and standardized to mean 0 / std 1.

The two methods

Normalization (Min-Max) (x − min) / (max − min)

Squeezes values into [0, 1]. Preserves the shape, but sensitive to outliers (min/max are extreme points).

Standardization (Z-score) (x − μ) / σ

Centres at 0 with unit variance. No fixed range; far more robust to outliers. The common default.

Which to use

Min-Max when
  • You need a bounded range (e.g. image pixels, neural net inputs)
  • The data has few outliers
  • The distribution isn't assumed to be Gaussian
Z-score when
  • There are outliers (min-max would crush everything else)
  • The model assumes roughly Gaussian features
  • You want the safe, general-purpose default
Don't leak

Fit the scaler (its min/max or μ/σ) on the training set only, then transform train, validation, and test with those same numbers. Re-fitting on test leaks information — see Train–Test Split.