Model Calibration
"70% chance of rain" should rain 70% of the time
A classifier hands you a score between 0 and 1, and it's tempting to read it as a probability. Calibration asks whether that reading is honest: among all the predictions where the model said 0.7, about 70% should actually turn out positive. A weather forecaster who says "70%" on ten different days should be right on about seven of them — same deal for your model.
Crucially, this is a different skill from ranking. A model can put every positive above every negative — a perfect AUC — while its actual numbers are nonsense (say, everything squashed between 0.45 and 0.55). AUC only cares about order; calibration cares about the values.
Does the model score positives above negatives? This is what ROC / AUC measures.
When the model says 0.7, is it right about 70% of the time? The number itself has to mean something.
Great AUC with terrible calibration is common. And recalibrating never changes the ranking — it's a monotone fix.
The reliability diagram
The standard diagnostic is simple: take held-out predictions, bucket them by confidence (0–0.1, 0.1–0.2, …, 0.9–1.0), and for each bucket plot the average predicted probability against the fraction that were actually positive. A perfectly calibrated model traces the diagonal.
Said 0.8, got 80% positives. The scores are trustworthy probabilities.
Said 0.6 but 75% were positive — the model is better than it claims.
Said 0.9 but only 70% were positive — the model bluffs. The most common failure.
ECE (expected calibration error) is the bucket-weighted average gap between the curve and the diagonal. The Brier score is even simpler: the mean squared error of the probabilities themselves — lower is better, and it rewards both good ranking and honest numbers.
Watch a bluffing model get fixed
Below, 200 held-out predictions fall into confidence buckets, the reliability diagram exposes an over-confident model, then Platt scaling and isotonic regression bend the curve back onto the diagonal — with the Brier score improving along the way.
Who's miscalibrated, and why
Its independence assumption double-counts correlated evidence, so Naive Bayes shoves probabilities toward 0.01 and 0.99. Ranking can still be fine.
Modern deep nets are famously over-confident — high accuracy, but a softmax that says 0.99 far more often than it deserves to.
SVMs output signed distances to the decision boundary, not probabilities at all. They need calibration just to speak the language.
Logistic regression directly optimizes log-loss, which punishes dishonest probabilities — so it's typically well-calibrated out of the box.
Averaging many trees rarely produces a 0 or a 1, so forests pull scores toward the middle — under-confident at the extremes.
The fixes: Platt scaling and isotonic regression
Both work the same way: keep the model, learn a small correction function that maps its raw scores to honest probabilities. Because the correction is monotone, the ranking (and AUC) is untouched.
- Fits a tiny logistic regression on the model's scores
- Parametric: one smooth S-shaped correction
- Works with little data, hard to overfit
- But it can only fix S-shaped miscalibration
- Fits a flexible monotone step function
- Non-parametric: matches almost any miscalibration shape
- Needs more data (think 1000+ samples)
- Can overfit small calibration sets
The correction must be learned on a separate calibration set (or via cross-validation), never on the training data. On its own training set the model already looks artificially confident-and-correct, so you'd "calibrate" to a lie — classic leakage. In sklearn, CalibratedClassifierCV(model, method="sigmoid") (Platt) or method="isotonic" handles the splitting for you.
When calibration actually matters
The test is simple: do you act on the number, or only on the order?
Expected-cost decisions ("act if p × cost-of-miss > cost-of-action"), medical risk scores shown to a doctor, combining or averaging scores from several models, and setting meaningful decision thresholds all consume the probability itself — a bluffing model quietly corrupts every one of them. If all you need is a top-k ranking (say, "show the 10 most likely churners"), calibration matters much less: only the order counts there.