Feature Selection · Suman Bhadra Notes

More features, more problems

More features ≠ better model. Every irrelevant column is noise the model will happily fit — inviting overfitting — plus slower training and a model nobody can explain. Feature selection keeps only the subset that earns its place.

Feature engineering creates candidate columns; feature selection decides which of them stay. With 200 features and 2,000 rows, a model has plenty of freedom to memorise coincidences — a junk column that happens to correlate with the target in this sample looks like signal until test day.

Selection vs. extraction

Selection keeps a subset of your original columns; PCA extracts brand-new combined ones. If you need to point at a column and say "this is income", you want selection.

Three families of methods

Filter score, then cut

Rank each feature on its own — no model involved. Cheap, fast, runs first.

Wrapper search with the model

Try subsets, keep whichever scores best. Powerful, but every try retrains the model.

Embedded built into training

The model prunes features while it fits — lasso zeroes them, trees rank them.

Watch the cut happen

Ten candidate features, half of them noise. See a filter gray out the weak ones, forward selection climb-then-dip as junk features join, and the lasso path squeeze coefficients to exactly zero.

Filter methods — score each feature alone

Filters look at one feature at a time, independent of any model, and keep the top of the ranking. They're the quick first pass when you have hundreds of columns.

Correlation |corr(x, target)|

How strongly the feature tracks the target. Fast — but only sees linear relationships.

Mutual information any dependency

Measures how much knowing the feature tells you about the target — catches non-linear signal correlation misses.

Variance threshold drop the flat

A column that barely changes can't predict anything. Drop near-constant features for free.

Blind spot: interactions

Filters score solo auditions, never the duet. Two features can be useless alone but golden together — each looks like noise on its own, yet their combination nails the target. A filter drops both.

Wrapper methods — let the model judge

Wrappers search over subsets of features using the model itself as the scorer. They catch interactions filters miss — at a price: every candidate subset means another round of training.

Forward selection start empty, add

Repeatedly add whichever feature improves validation most. Stop when adding stops helping.

Backward elimination start full, drop

Train on everything, remove the least useful feature, repeat until the score suffers.

RFE recursive elimination

Fit, rank features by the model's own weights, drop the weakest, and recurse.

The bill

Each step retrains the model. With 100 features, one full pass of forward selection is thousands of fits — wrappers are for when you've already filtered down to a shortlist.

Embedded methods — selection during training

Some models do the pruning as a side effect of fitting — no separate search needed.

L1 / lasso coefficients → 0

The L1 penalty pushes weak coefficients to exactly zero — the model drops features itself. See Ridge vs. Lasso.

Tree importances ranking for free

Random forests and boosted trees report how much each feature contributed to splits — a ready-made ranking.

One caveat split credit

Correlated features share importance — two near-duplicates each look half as useful as either really is. Weak-looking ≠ useless.

Pitfalls

The classic leakage trap

Selecting features using all the data before splitting lets the test set vote on which features survive — that's data leakage, and it inflates your scores. Do selection inside the cross-validation loop, so each fold picks features from its own training data only.

Do

Select inside the CV loop (a pipeline makes this automatic)
Check the chosen set is stable across folds — wildly different picks each fold is a warning sign
Filter first, then spend wrapper budget on the shortlist

Don't

Score features on the full dataset before splitting
Trust rankings blindly when features are correlated — they make rankings unstable
Re-select over and over to chase the validation score — you'll overfit the validation set itself