Encoding Categorical Variables
Models can't read "red"
Almost every algorithm needs numbers as input — yet so much real data is categorical: colours, cities, brands, days of the week. Encoding turns those labels into numbers a model can use.
The catch: how you encode can accidentally tell the model something false. The two main schemes differ exactly on that point.
Two ways to encode a colour column
Watch the same colour column become numbers under label encoding (one integer per category) and one-hot encoding (one binary column per category).
The schemes
One integer per category. Compact — but label encoding implies an order that may not exist (bad for unordered categories like colours). Reserve ordinal encoding for genuinely ranked categories (small < medium < large).
A separate 0/1 column per category. No fake order, but widens the table.
Replace a category with a statistic (mean target, frequency). Handy for hundreds of categories — guard against leakage.
The trap of label encoding
Encoding red=0, green=1, blue=2 tells a linear model that blue (2) is "twice" green (1) and that green sits exactly between red and blue. For unordered categories that's nonsense — and it can hurt the model.
- The categories have a real order (small < medium < large)
- You use a tree-based model (splits don't assume spacing)
- Cardinality is high and one-hot would explode
- Categories are unordered (colour, city)
- You use a linear model, SVM, or neural net
- The number of categories is small
Practical notes
For linear models, drop one one-hot column to avoid perfect collinearity.
Decide how to handle a category not seen in training — map to "unknown" or ignore.
Learn the category→number mapping on training data, then apply it everywhere.
This is one piece of Feature Engineering; numeric columns usually still need Feature Scaling.