Feature Selection
More features, more problems
More features ≠ better model. Every irrelevant column is noise the model will happily fit — inviting overfitting — plus slower training and a model nobody can explain. Feature selection keeps only the subset that earns its place.
Feature engineering creates candidate columns; feature selection decides which of them stay. With 200 features and 2,000 rows, a model has plenty of freedom to memorise coincidences — a junk column that happens to correlate with the target in this sample looks like signal until test day.
Selection keeps a subset of your original columns; PCA extracts brand-new combined ones. If you need to point at a column and say "this is income", you want selection.
Three families of methods
Rank each feature on its own — no model involved. Cheap, fast, runs first.
Try subsets, keep whichever scores best. Powerful, but every try retrains the model.
The model prunes features while it fits — lasso zeroes them, trees rank them.
Watch the cut happen
Ten candidate features, half of them noise. See a filter gray out the weak ones, forward selection climb-then-dip as junk features join, and the lasso path squeeze coefficients to exactly zero.
Filter methods — score each feature alone
Filters look at one feature at a time, independent of any model, and keep the top of the ranking. They're the quick first pass when you have hundreds of columns.
How strongly the feature tracks the target. Fast — but only sees linear relationships.
Measures how much knowing the feature tells you about the target — catches non-linear signal correlation misses.
A column that barely changes can't predict anything. Drop near-constant features for free.
Filters score solo auditions, never the duet. Two features can be useless alone but golden together — each looks like noise on its own, yet their combination nails the target. A filter drops both.
Wrapper methods — let the model judge
Wrappers search over subsets of features using the model itself as the scorer. They catch interactions filters miss — at a price: every candidate subset means another round of training.
Repeatedly add whichever feature improves validation most. Stop when adding stops helping.
Train on everything, remove the least useful feature, repeat until the score suffers.
Fit, rank features by the model's own weights, drop the weakest, and recurse.
Each step retrains the model. With 100 features, one full pass of forward selection is thousands of fits — wrappers are for when you've already filtered down to a shortlist.
Embedded methods — selection during training
Some models do the pruning as a side effect of fitting — no separate search needed.
The L1 penalty pushes weak coefficients to exactly zero — the model drops features itself. See Ridge vs. Lasso.
Random forests and boosted trees report how much each feature contributed to splits — a ready-made ranking.
Correlated features share importance — two near-duplicates each look half as useful as either really is. Weak-looking ≠ useless.
Pitfalls
Selecting features using all the data before splitting lets the test set vote on which features survive — that's data leakage, and it inflates your scores. Do selection inside the cross-validation loop, so each fold picks features from its own training data only.
- Select inside the CV loop (a pipeline makes this automatic)
- Check the chosen set is stable across folds — wildly different picks each fold is a warning sign
- Filter first, then spend wrapper budget on the shortlist
- Score features on the full dataset before splitting
- Trust rankings blindly when features are correlated — they make rankings unstable
- Re-select over and over to chase the validation score — you'll overfit the validation set itself