Handling Class Imbalance · Suman Bhadra Notes

The 99% accuracy trap

Train a fraud detector where only 1 transaction in 100 is fraud, and a model that answers "not fraud" for everything scores 99% accuracy — while catching exactly zero fraud. Accuracy lies whenever the classes are imbalanced, because the majority class drowns out the one you actually care about. The confusion matrix exposes the trick instantly: the entire positive row is misses.

The lazy model predict majority

Always answer "not fraud". No learning, no features — just bet on the bigger class.

Its accuracy 99%

Looks brilliant on a leaderboard. The number is real — it's just answering the wrong question.

Its recall 0%

Every single fraud slips through. Recall is the metric that screams.

Fix the metric first

Before touching the data or the model, stop trusting accuracy. If your scorecard rewards the lazy model, every fix downstream is flying blind.

Precision, recall, F1 per-class truth

Precision & recall grade the minority class directly — the lazy model's recall of 0 can't hide.

PR curve over ROC rare positives

When positives are rare, the ROC curve can look rosy because true negatives are cheap. The precision–recall curve doesn't count TNs at all, so it stays honest.

Stratified split keep the ratio

Always use a stratified train/test split — a random split can leave your test set with almost no positives to evaluate on.

One dataset, four fixes

Below is a fraud-style dataset: a sea of legit transactions (blue) and a handful of fraud (orange). Step through the lazy baseline, then watch class weights, SMOTE, and threshold tuning each attack the problem — with accuracy, precision and recall recomputed live.

Fix the data: resampling

If the model rarely sees fraud, show it more fraud (or less of everything else). All of these rebalance the training set the model learns from.

Undersample shrink majority

Throw away majority rows until the classes are closer. Fast and simple, but you're discarding real data.

Oversample repeat minority

Duplicate minority rows. Cheap, but the model can memorize the exact copies and overfit.

SMOTE synthesize

Instead of copying, SMOTE draws a line between a minority point and one of its nearest minority neighbors and drops a new synthetic point partway along it.

Resample the training set only — after splitting

Split first, then resample only the training fold. If you oversample or SMOTE before the split, copies (or near-copies) of the same minority points land in both train and test — the model is graded on data it has effectively already seen, and your scores are fiction.

Fix the algorithm

Often you don't need to touch the data at all — tell the model the minority matters more, or move the cutoff after training.

Class weights cost the misses

Most sklearn models accept class_weight="balanced": misclassifying a minority point costs more in the loss, so the boundary shifts to protect them.

Threshold tuning 0.5 isn't sacred

The default 0.5 cutoff on predicted probability is a convention, not a law. Slide it down to catch more positives, up to flag fewer — and use the PR curve to pick the trade-off your costs demand.

More minority data the real fix

No technique beats genuinely more examples of the rare class. If you can collect or label more, do that first.

SMOTE isn't magic

Synthetic points interpolated between neighbors can blur the class boundary or manufacture fraud inside legit territory, and SMOTE does nothing if your minority points are just inseparable noise. In practice, class weights + threshold tuning + the right metric beat fancy resampling more often than the tutorials admit. Try the boring fix first.