Linear Regression

ML DL regression supervised least squares

What it is

Linear regression is a simple way to spot a trend in your data and use it to make predictions.

Suppose you have a list of students with the hours each one studied and the exam score they got. Plot those pairs on a chart and you'll usually notice a trend: more study tends to mean a higher score. Linear regression captures that trend by drawing a single straight line through the cloud of points — and once you have the line, you can read off a predicted score for any new student just from how long they studied.

In one sentence

Given pairs of numbers that move together, linear regression finds the straight-line rule that best turns one number into the other.

With one input, the model is just a line — described by two numbers:

Slope (m) how steep

How much the prediction changes when the input goes up by one.

Intercept (b) where it starts

The prediction when the input is zero — the line's height at the y-axis.

The equation y = m·x + b

Plug a new x in, get the predicted y out.

Simple vs. multiple

One input is simple linear regression. Several inputs — for example, predicting rent from square footage and bedroom count — is multiple linear regression. Same idea, just more knobs to tune.

Where it's used

Anywhere a "more of X tends to mean more (or less) of Y" relationship shows up — across science, business, and engineering.

Real estate rent ~ sqft

Predict monthly rent from apartment size.

Marketing sales ~ ad spend

Forecast revenue from advertising budget.

Education score ~ hours

Estimate exam score from hours studied.

Health bp ~ age

Estimate the trend of blood pressure with age.

Worked example

This page uses a 5-student dataset of hours studied vs. exam score — small enough to follow point by point, real enough to make the line meaningful.

What "best" means

For any candidate line, the vertical gap from each point to the line is called a residual — it's how wrong the line is for that observation.

Definition

The least-squares line is the one with the smallest total of squared residuals. Out of every possible line, exactly one wins.

How the line is found

Two common approaches, both ending at the same line:

  • A direct formula — solves for the best line in one shot. Great for small datasets.
  • Gradient descent — start with any line and keep nudging it in the direction that lowers the error. Scales to huge datasets.

Watch the construction

The animation builds the least-squares line on a tiny dataset of study hours vs. exam scores. The line starts flat, residuals turn into literal red squares, then the line tilts to shrink them.

Why we square the gaps

Reason 1 no cancellation

Without squaring, points above and below the line cancel out. A bad line could net zero by accident.

Reason 2 big misses hurt more

A residual of 4 contributes 16. A residual of 1 contributes 1. So the fit works hardest to avoid large misses.

Reading the result

For our study-hours dataset, the best-fit line lands at:

The fitted line

y = 6.4 · x + 45.4

Slope m = 6.4

Each extra hour of study is worth about 6.4 more exam points.

Intercept b = 45.4

The score the model predicts for zero hours of study.

Predict for a new x

Plug it into the equation. For 6 hours: 6.4 × 6 + 45.4 ≈ 83.8.

Evaluation — how good is the fit?

Once we have a line, we want a number that tells us how well it summarises the data. Two simple ones cover most situations.

RMSE typical error

Roughly how far off the line's predictions are, in the same units as y. Smaller is better.

R² (R-squared) 0 → 1

How much of the pattern the line captures. 1 = perfect, 0 = no better than guessing the average.

Our study-hours fit

R² = 0.996 — the line captures nearly all the pattern. RMSE ≈ 0.57 exam points — predictions are off by roughly half a point on average.

Assumptions

Linear regression works best when these three conditions are roughly true.

Straight-line relationship no curves

The trend between x and y looks roughly like a straight line, not a curve.

Independent observations no shared signal

Each data point stands on its own — one point's value doesn't tell you about the next.

Even spread of errors consistent scatter

Points scatter around the line by about the same amount everywhere — no fan or funnel shape.

When it works — and when it doesn't

Works well when
  • The relationship is roughly straight
  • The spread around the line is fairly even
  • No single point is wildly out of line
Struggles when
  • The data curves — no straight line fits
  • Outliers exist — one extreme point yanks the whole line
  • Many drivers — extend to multiple linear regression