Handling Class Imbalance
The 99% accuracy trap
Train a fraud detector where only 1 transaction in 100 is fraud, and a model that answers "not fraud" for everything scores 99% accuracy — while catching exactly zero fraud. Accuracy lies whenever the classes are imbalanced, because the majority class drowns out the one you actually care about. The confusion matrix exposes the trick instantly: the entire positive row is misses.
Always answer "not fraud". No learning, no features — just bet on the bigger class.
Looks brilliant on a leaderboard. The number is real — it's just answering the wrong question.
Fix the metric first
Before touching the data or the model, stop trusting accuracy. If your scorecard rewards the lazy model, every fix downstream is flying blind.
Precision & recall grade the minority class directly — the lazy model's recall of 0 can't hide.
When positives are rare, the ROC curve can look rosy because true negatives are cheap. The precision–recall curve doesn't count TNs at all, so it stays honest.
Always use a stratified train/test split — a random split can leave your test set with almost no positives to evaluate on.
One dataset, four fixes
Below is a fraud-style dataset: a sea of legit transactions (blue) and a handful of fraud (orange). Step through the lazy baseline, then watch class weights, SMOTE, and threshold tuning each attack the problem — with accuracy, precision and recall recomputed live.
Fix the data: resampling
If the model rarely sees fraud, show it more fraud (or less of everything else). All of these rebalance the training set the model learns from.
Throw away majority rows until the classes are closer. Fast and simple, but you're discarding real data.
Duplicate minority rows. Cheap, but the model can memorize the exact copies and overfit.
Instead of copying, SMOTE draws a line between a minority point and one of its nearest minority neighbors and drops a new synthetic point partway along it.
Split first, then resample only the training fold. If you oversample or SMOTE before the split, copies (or near-copies) of the same minority points land in both train and test — the model is graded on data it has effectively already seen, and your scores are fiction.
Fix the algorithm
Often you don't need to touch the data at all — tell the model the minority matters more, or move the cutoff after training.
Most sklearn models accept class_weight="balanced": misclassifying a minority point costs more in the loss, so the boundary shifts to protect them.
The default 0.5 cutoff on predicted probability is a convention, not a law. Slide it down to catch more positives, up to flag fewer — and use the PR curve to pick the trade-off your costs demand.
No technique beats genuinely more examples of the rare class. If you can collect or label more, do that first.
Synthetic points interpolated between neighbors can blur the class boundary or manufacture fraud inside legit territory, and SMOTE does nothing if your minority points are just inseparable noise. In practice, class weights + threshold tuning + the right metric beat fancy resampling more often than the tutorials admit. Try the boring fix first.