Precision, Recall & F1
Two questions, two metrics
Once you have a confusion matrix, precision and recall ask two different things about the positive class.
"When the model says positive, how often is it right?" Punishes false alarms.
"Of all the real positives, how many did we catch?" Punishes misses.
2·P·R / (P + R) — one number that's only high when both are high.
Drag the threshold
Each dot is an item with a model score from 0 to 1; blue dots are truly positive, grey are truly negative. Everything to the right of the threshold is predicted positive. Slide it and watch precision and recall trade off in real time.
Move it left to catch more positives (higher recall, more false alarms); move it right to be stricter (higher precision, more misses).
Which to prioritize?
It depends entirely on the cost of each mistake.
- Cancer screening — missing a sick patient (FN) is far worse than a false alarm
- Fraud detection — better to review a few extra than let fraud through
- Spam filter — junking a real email (FP) annoys users more than one spam slipping through
- Recommendations — a bad suggestion erodes trust
The harmonic mean stays low if either precision or recall is low — you can't game F1 by maxing one and ignoring the other. A plain average could be fooled.
Beyond F1
β > 1 favours recall, β < 1 favours precision — tune to your cost ratio.
Average per-class F1 (macro) or pool all counts first (micro) when there are many classes.