Feature Engineering

ML features preprocessing domain knowledge

Garbage in, garbage out

Feature engineering is reshaping your raw data into inputs that expose the pattern a model is trying to learn.

A famous truism: "applied machine learning is basically feature engineering." The same algorithm can flop or shine depending entirely on what you feed it. Often a well-crafted feature beats a fancier model on raw columns.

In one sentence

Use what you know about the problem to build columns that make the right answer easy to read off.

See a feature unlock the pattern

Two classes that no straight line can separate in the raw (x, y) space. Engineer one new feature — the distance from the centre — and suddenly the classes pull apart cleanly.

Common moves

Ratios & differences price ÷ sqft

A ratio often carries the real signal — price per square foot beats price and size separately.

Interactions a × b

Multiply features when their combination matters, not each alone.

Date parts → day, month, is_weekend

Explode a timestamp into the pieces a model can use.

Binning age → bracket

Group a continuous value into ranges when the relationship is steppy, not smooth.

Aggregations avg, count, max

Summarise related rows — a user's average order, total visits, last-seen gap.

Transforms log, sqrt

Tame skewed values so big outliers don't dominate.

Related preprocessing steps

Feature engineering overlaps with a few cousin steps, each with its own article:

Encoding categories → numbers

Turn text categories into numeric form — see Encoding Categorical Variables.

Scaling put on one scale

Normalise ranges so no feature dominates by units — see Feature Scaling.

Missing values fill the gaps

Decide what to do with blanks — see Handling Missing Values.

Do's and don'ts

Do
  • Lean on domain knowledge — what actually drives the outcome?
  • Fit transforms on training data only, then apply to test
  • Check that a new feature truly helps on validation
Don't
  • Leak the target into a feature (data leakage)
  • Use future information unavailable at prediction time
  • Add hundreds of features blindly — invites overfitting
Note on deep learning

Neural networks learn features automatically from raw signals (pixels, audio, text), which is why manual feature engineering matters less there — and enormously for classic tabular ML.