Logistic Regression
What it is
Logistic regression is a simple way to turn a measurement into a probability for a yes/no question.
Suppose you have a list of students with the hours each one studied and whether they ended up passing the exam. Plot those pairs and you'll notice something: more study tends to mean a higher chance of passing — not a guaranteed score, just better odds. Logistic regression captures that pattern by drawing a smooth S-shaped curve through the data, where the height of the curve at any x is the model's estimated probability of a "yes" for that input.
Given pairs of (input, yes/no) outcomes, logistic regression finds the S-shaped rule that turns any input into a probability between 0 and 1.
With one input, the model is described by two numbers plus a squashing function:
How sharply the probability rises (or falls) as the input grows. A bigger w makes the curve steeper.
Shifts the curve left or right. The point where the predicted probability crosses 0.5 sits at x = −b / w.
Compute a score z = w·x + b, then squash it with the sigmoid σ to get a probability.
"Regression" is in the name because the model regresses the log-odds on the inputs — but you use it to classify by thresholding the output probability (typically at 0.5).
Where it's used
Anywhere the question is "yes or no?" and you want a calibrated probability rather than a hard guess.
Score each message on word features and predict the probability it is spam.
From lab values, estimate the probability that a patient has a condition.
From applicant features, predict the chance a loan will go bad.
For each ad shown, predict the probability the user will click.
This page uses an 8-student dataset of hours studied vs. pass/fail — small enough to follow point by point, real enough to make the curve meaningful.
Why not just use linear regression?
A straight line can output any number — billions, negatives, anything. But a probability has to live between 0 and 1. We need a way to take an unbounded score and squash it into that range.
σ(z) = 1 / (1 + e−z). Feed in any real number; out comes a value strictly between 0 and 1. Big positive z → near 1. Big negative z → near 0. Zero z → exactly 0.5.
Logistic regression keeps the linear part — a weighted sum of the inputs — and just runs the result through the sigmoid at the end.
The model in two steps
The same weighted sum that linear regression uses. With several inputs it becomes z = w₁·x₁ + w₂·x₂ + … + b.
Pass the score through the sigmoid to get a probability between 0 and 1.
Each weight has a tidy meaning: a one-unit increase in x multiplies the odds of a yes by ew. That's why logistic regression is popular when stakeholders want an interpretable model.
What "best" means — log loss
For any candidate curve, each training point has a true label (0 or 1) and a predicted probability. We want a single number that says how wrong the curve is across the whole dataset.
The log loss (or cross-entropy) of a single point is −log(p) when the true label is 1 and −log(1 − p) when it's 0. Average across all points and you get the loss we want to minimise.
Log loss punishes confident wrong answers harshly — predicting 0.99 when the truth is 0 sends the loss to infinity. It also makes the optimisation well-behaved: the loss surface has one minimum and gradient descent slides straight to it.
Unlike linear regression, there is no closed-form formula. We start with any curve and use gradient descent — repeatedly nudge w and b in the direction that lowers the log loss until it stops shrinking.
Watch the construction
The animation builds the logistic curve on a tiny dataset of study hours vs. pass/fail. The curve starts flat at 0.5 (no opinion), gradient descent steepens and shifts it, and the threshold line at p = 0.5 finally splits the inputs into a "fail" zone and a "pass" zone.
The decision threshold
The model produces a probability; turning that into a yes/no requires a cut-off. The default is 0.5, but it is a knob you can turn.
Flag more cases as "yes". Useful when missing a positive is expensive — for example, screening for a serious disease.
Flag fewer cases, but be more confident in each. Useful when a false alarm is costly — for example, blocking a transaction.
With one input, the threshold becomes a single point on the x-axis: x* = −b / w. Anything above is predicted "yes", anything below is "no". With two inputs it becomes a straight line; with more, a flat plane.
Reading the result
For our study-hours dataset, the fitted curve lands close to:
p = σ(w·x + b) — the visualization above shows the exact w and b the gradient-descent fit converges to, along with the decision boundary x* = −b / w.
More hours studied means higher pass probability; the larger w, the sharper the transition from fail to pass.
The negative intercept shifts the 50-50 point to the right — at zero hours, the model predicts a low probability of passing.
Compute z = w·x + b, then p = σ(z). If p > 0.5, predict pass; otherwise, predict fail.
Evaluation — how good is the classifier?
Once we have a fitted curve, we want to know how well it actually separates yes from no. A few standard numbers cover most situations.
Fraction of points the threshold gets right. Easy to read, but misleading when the classes are imbalanced.
Precision: of the "yes" predictions, how many were really yes? Recall: of the real yes cases, how many did we catch?
The same number we minimised. Lower is better. Rewards probabilities that are both right and confident.
How well the score ranks positives above negatives across every threshold. 1 = perfect ordering, 0.5 = random guessing.
Assumptions
Logistic regression works best when these conditions are roughly true.
The log-odds of the outcome change roughly linearly with each input — the boundary is flat in feature space, even though the probability curve itself is S-shaped.
Each example stands on its own — one row's outcome doesn't tell you about the next.
Inputs shouldn't be near-duplicates of each other; if they are, the weights become unstable and hard to interpret.
When it works — and when it doesn't
- The classes are roughly linearly separable in feature space
- You need calibrated probabilities, not just a yes/no
- You want interpretable coefficients for each input
- The boundary curves — reach for trees, kernels, or a neural net
- Classes are heavily imbalanced — needs reweighting or resampling
- Features are highly correlated — coefficients become unstable