Logistic Regression

ML DL classification supervised sigmoid

What it is

Logistic regression is a simple way to turn a measurement into a probability for a yes/no question.

Suppose you have a list of students with the hours each one studied and whether they ended up passing the exam. Plot those pairs and you'll notice something: more study tends to mean a higher chance of passing — not a guaranteed score, just better odds. Logistic regression captures that pattern by drawing a smooth S-shaped curve through the data, where the height of the curve at any x is the model's estimated probability of a "yes" for that input.

In one sentence

Given pairs of (input, yes/no) outcomes, logistic regression finds the S-shaped rule that turns any input into a probability between 0 and 1.

With one input, the model is described by two numbers plus a squashing function:

Weight (w) how steep

How sharply the probability rises (or falls) as the input grows. A bigger w makes the curve steeper.

Intercept (b) where it tips

Shifts the curve left or right. The point where the predicted probability crosses 0.5 sits at x = −b / w.

The equation p = σ(w·x + b)

Compute a score z = w·x + b, then squash it with the sigmoid σ to get a probability.

Despite the name, it's classification

"Regression" is in the name because the model regresses the log-odds on the inputs — but you use it to classify by thresholding the output probability (typically at 0.5).

Where it's used

Anywhere the question is "yes or no?" and you want a calibrated probability rather than a hard guess.

Email spam vs not

Score each message on word features and predict the probability it is spam.

Medicine disease yes/no

From lab values, estimate the probability that a patient has a condition.

Finance default risk

From applicant features, predict the chance a loan will go bad.

Web click-through

For each ad shown, predict the probability the user will click.

Worked example

This page uses an 8-student dataset of hours studied vs. pass/fail — small enough to follow point by point, real enough to make the curve meaningful.

Why not just use linear regression?

A straight line can output any number — billions, negatives, anything. But a probability has to live between 0 and 1. We need a way to take an unbounded score and squash it into that range.

The sigmoid function

σ(z) = 1 / (1 + e−z). Feed in any real number; out comes a value strictly between 0 and 1. Big positive z → near 1. Big negative z → near 0. Zero z → exactly 0.5.

Logistic regression keeps the linear part — a weighted sum of the inputs — and just runs the result through the sigmoid at the end.

The model in two steps

Step 1 — Score z = w·x + b

The same weighted sum that linear regression uses. With several inputs it becomes z = w₁·x₁ + w₂·x₂ + … + b.

Step 2 — Squash p = σ(z)

Pass the score through the sigmoid to get a probability between 0 and 1.

Reading the weights

Each weight has a tidy meaning: a one-unit increase in x multiplies the odds of a yes by ew. That's why logistic regression is popular when stakeholders want an interpretable model.

What "best" means — log loss

For any candidate curve, each training point has a true label (0 or 1) and a predicted probability. We want a single number that says how wrong the curve is across the whole dataset.

Definition

The log loss (or cross-entropy) of a single point is −log(p) when the true label is 1 and −log(1 − p) when it's 0. Average across all points and you get the loss we want to minimise.

Why log loss, not squared error?

Log loss punishes confident wrong answers harshly — predicting 0.99 when the truth is 0 sends the loss to infinity. It also makes the optimisation well-behaved: the loss surface has one minimum and gradient descent slides straight to it.

How the curve is found

Unlike linear regression, there is no closed-form formula. We start with any curve and use gradient descent — repeatedly nudge w and b in the direction that lowers the log loss until it stops shrinking.

Watch the construction

The animation builds the logistic curve on a tiny dataset of study hours vs. pass/fail. The curve starts flat at 0.5 (no opinion), gradient descent steepens and shifts it, and the threshold line at p = 0.5 finally splits the inputs into a "fail" zone and a "pass" zone.

The decision threshold

The model produces a probability; turning that into a yes/no requires a cut-off. The default is 0.5, but it is a knob you can turn.

Lower threshold catch more yes

Flag more cases as "yes". Useful when missing a positive is expensive — for example, screening for a serious disease.

Higher threshold be more sure

Flag fewer cases, but be more confident in each. Useful when a false alarm is costly — for example, blocking a transaction.

The decision boundary

With one input, the threshold becomes a single point on the x-axis: x* = −b / w. Anything above is predicted "yes", anything below is "no". With two inputs it becomes a straight line; with more, a flat plane.

Reading the result

For our study-hours dataset, the fitted curve lands close to:

The fitted curve

p = σ(w·x + b) — the visualization above shows the exact w and b the gradient-descent fit converges to, along with the decision boundary x* = −b / w.

Weight (w) positive & large

More hours studied means higher pass probability; the larger w, the sharper the transition from fail to pass.

Intercept (b) negative

The negative intercept shifts the 50-50 point to the right — at zero hours, the model predicts a low probability of passing.

Predict for a new x

Compute z = w·x + b, then p = σ(z). If p > 0.5, predict pass; otherwise, predict fail.

Evaluation — how good is the classifier?

Once we have a fitted curve, we want to know how well it actually separates yes from no. A few standard numbers cover most situations.

Accuracy % correct

Fraction of points the threshold gets right. Easy to read, but misleading when the classes are imbalanced.

Precision & recall trade-off

Precision: of the "yes" predictions, how many were really yes? Recall: of the real yes cases, how many did we catch?

Log loss probability quality

The same number we minimised. Lower is better. Rewards probabilities that are both right and confident.

ROC-AUC 0.5 → 1

How well the score ranks positives above negatives across every threshold. 1 = perfect ordering, 0.5 = random guessing.

Assumptions

Logistic regression works best when these conditions are roughly true.

Linear log-odds no curves in z

The log-odds of the outcome change roughly linearly with each input — the boundary is flat in feature space, even though the probability curve itself is S-shaped.

Independent observations no shared signal

Each example stands on its own — one row's outcome doesn't tell you about the next.

No extreme multicollinearity distinct inputs

Inputs shouldn't be near-duplicates of each other; if they are, the weights become unstable and hard to interpret.

When it works — and when it doesn't

Works well when
  • The classes are roughly linearly separable in feature space
  • You need calibrated probabilities, not just a yes/no
  • You want interpretable coefficients for each input
Struggles when
  • The boundary curves — reach for trees, kernels, or a neural net
  • Classes are heavily imbalanced — needs reweighting or resampling
  • Features are highly correlated — coefficients become unstable