Encoding Categorical Variables · Suman Bhadra Notes

Models can't read "red"

Almost every algorithm needs numbers as input — yet so much real data is categorical: colours, cities, brands, days of the week. Encoding turns those labels into numbers a model can use.

The catch: how you encode can accidentally tell the model something false. The two main schemes differ exactly on that point.

Two ways to encode a colour column

Watch the same colour column become numbers under label encoding (one integer per category) and one-hot encoding (one binary column per category).

The schemes

Label / Ordinal red=0, green=1, blue=2

One integer per category. Compact — but label encoding implies an order that may not exist (bad for unordered categories like colours). Reserve ordinal encoding for genuinely ranked categories (small < medium < large).

One-Hot 3 binary columns

A separate 0/1 column per category. No fake order, but widens the table.

Target / frequency high cardinality

Replace a category with a statistic (mean target, frequency). Handy for hundreds of categories — guard against leakage.

The trap of label encoding

Fake order

Encoding red=0, green=1, blue=2 tells a linear model that blue (2) is "twice" green (1) and that green sits exactly between red and blue. For unordered categories that's nonsense — and it can hurt the model.

Use label/ordinal when

The categories have a real order (small < medium < large)
You use a tree-based model (splits don't assume spacing)
Cardinality is high and one-hot would explode

Use one-hot when

Categories are unordered (colour, city)
You use a linear model, SVM, or neural net
The number of categories is small

Practical notes

Dummy trap drop one column

For linear models, drop one one-hot column to avoid perfect collinearity.

Unseen categories at predict time

Decide how to handle a category not seen in training — map to "unknown" or ignore.

Fit on train no leakage

Learn the category→number mapping on training data, then apply it everywhere.