Precision, Recall & F1

ML classification evaluation threshold F1

Two questions, two metrics

Once you have a confusion matrix, precision and recall ask two different things about the positive class.

Precision TP / (TP + FP)

"When the model says positive, how often is it right?" Punishes false alarms.

Recall TP / (TP + FN)

"Of all the real positives, how many did we catch?" Punishes misses.

F1 score harmonic mean

2·P·R / (P + R) — one number that's only high when both are high.

Drag the threshold

Each dot is an item with a model score from 0 to 1; blue dots are truly positive, grey are truly negative. Everything to the right of the threshold is predicted positive. Slide it and watch precision and recall trade off in real time.

Move it left to catch more positives (higher recall, more false alarms); move it right to be stricter (higher precision, more misses).

Which to prioritize?

It depends entirely on the cost of each mistake.

Maximize recall when misses are deadly
  • Cancer screening — missing a sick patient (FN) is far worse than a false alarm
  • Fraud detection — better to review a few extra than let fraud through
Maximize precision when false alarms are costly
  • Spam filter — junking a real email (FP) annoys users more than one spam slipping through
  • Recommendations — a bad suggestion erodes trust
Why harmonic mean for F1?

The harmonic mean stays low if either precision or recall is low — you can't game F1 by maxing one and ignoring the other. A plain average could be fooled.

Beyond F1

Fβ score weighted

β > 1 favours recall, β < 1 favours precision — tune to your cost ratio.

Macro / micro avg multi-class

Average per-class F1 (macro) or pool all counts first (micro) when there are many classes.

PR & ROC curves all thresholds

Sweep every threshold at once — see ROC Curve & AUC.