Model Calibration

ML probabilities reliability Platt & isotonic

"70% chance of rain" should rain 70% of the time

A classifier hands you a score between 0 and 1, and it's tempting to read it as a probability. Calibration asks whether that reading is honest: among all the predictions where the model said 0.7, about 70% should actually turn out positive. A weather forecaster who says "70%" on ten different days should be right on about seven of them — same deal for your model.

Crucially, this is a different skill from ranking. A model can put every positive above every negative — a perfect AUC — while its actual numbers are nonsense (say, everything squashed between 0.45 and 0.55). AUC only cares about order; calibration cares about the values.

Discrimination ranking

Does the model score positives above negatives? This is what ROC / AUC measures.

Calibration honesty

When the model says 0.7, is it right about 70% of the time? The number itself has to mean something.

Independent skills one ≠ other

Great AUC with terrible calibration is common. And recalibrating never changes the ranking — it's a monotone fix.

The reliability diagram

The standard diagnostic is simple: take held-out predictions, bucket them by confidence (0–0.1, 0.1–0.2, …, 0.9–1.0), and for each bucket plot the average predicted probability against the fraction that were actually positive. A perfectly calibrated model traces the diagonal.

On the diagonal calibrated

Said 0.8, got 80% positives. The scores are trustworthy probabilities.

Above the diagonal under-confident

Said 0.6 but 75% were positive — the model is better than it claims.

Below the diagonal over-confident

Said 0.9 but only 70% were positive — the model bluffs. The most common failure.

Putting a number on it

ECE (expected calibration error) is the bucket-weighted average gap between the curve and the diagonal. The Brier score is even simpler: the mean squared error of the probabilities themselves — lower is better, and it rewards both good ranking and honest numbers.

Watch a bluffing model get fixed

Below, 200 held-out predictions fall into confidence buckets, the reliability diagram exposes an over-confident model, then Platt scaling and isotonic regression bend the curve back onto the diagonal — with the Brier score improving along the way.

Who's miscalibrated, and why

Naive Bayes extremist

Its independence assumption double-counts correlated evidence, so Naive Bayes shoves probabilities toward 0.01 and 0.99. Ranking can still be fine.

Neural networks over-confident

Modern deep nets are famously over-confident — high accuracy, but a softmax that says 0.99 far more often than it deserves to.

SVMs not even probabilities

SVMs output signed distances to the decision boundary, not probabilities at all. They need calibration just to speak the language.

Logistic regression usually honest

Logistic regression directly optimizes log-loss, which punishes dishonest probabilities — so it's typically well-calibrated out of the box.

Random forests middle-squashers

Averaging many trees rarely produces a 0 or a 1, so forests pull scores toward the middle — under-confident at the extremes.

The fixes: Platt scaling and isotonic regression

Both work the same way: keep the model, learn a small correction function that maps its raw scores to honest probabilities. Because the correction is monotone, the ranking (and AUC) is untouched.

Platt scaling
  • Fits a tiny logistic regression on the model's scores
  • Parametric: one smooth S-shaped correction
  • Works with little data, hard to overfit
  • But it can only fix S-shaped miscalibration
Isotonic regression
  • Fits a flexible monotone step function
  • Non-parametric: matches almost any miscalibration shape
  • Needs more data (think 1000+ samples)
  • Can overfit small calibration sets
Fit it on held-out data — always

The correction must be learned on a separate calibration set (or via cross-validation), never on the training data. On its own training set the model already looks artificially confident-and-correct, so you'd "calibrate" to a lie — classic leakage. In sklearn, CalibratedClassifierCV(model, method="sigmoid") (Platt) or method="isotonic" handles the splitting for you.

When calibration actually matters

The test is simple: do you act on the number, or only on the order?

Calibrate when the number drives the decision

Expected-cost decisions ("act if p × cost-of-miss > cost-of-action"), medical risk scores shown to a doctor, combining or averaging scores from several models, and setting meaningful decision thresholds all consume the probability itself — a bluffing model quietly corrupts every one of them. If all you need is a top-k ranking (say, "show the 10 most likely churners"), calibration matters much less: only the order counts there.