K-Nearest Neighbors

ML classification regression supervised distance lazy learning

What it is

K-Nearest Neighbors (KNN) is one of the simplest ideas in machine learning: to label something new, ask its closest neighbors what they are, and go with the majority.

Imagine you move to a new neighborhood and want to guess whether a stranger is a coffee drinker or a tea drinker. The fastest way is to look at the five people who live closest to them and just count. If four of them drink coffee, the stranger probably does too. KNN does exactly that, but on numbers — it stores all the labelled examples it has ever seen, and when a new example shows up, it finds the K closest ones and copies their answer.

In one sentence

To predict the label of a new point, find the K training points closest to it and let them vote.

Three knobs control the whole algorithm:

K how many neighbors

The number of nearest training points that get a say. Small K listens closely; large K averages widely.

Distance how we measure closeness

The rule that decides which points are "near." Usually plain straight-line (Euclidean) distance.

Vote / average how a label is chosen

Classification takes the majority class. Regression averages the neighbors' values.

Lazy learner

KNN does no real training — it just memorizes the dataset. All the work happens at prediction time, when it has to scan the stored points and find the closest ones.

Where it's used

Anywhere a "things that look similar tend to share a label" rule of thumb works — and the dataset is small enough that scanning it is cheap.

Recommendations "users like you"

Find the users most similar to you and recommend what they liked.

Handwriting digit recognition

Classify a scanned digit by comparing it to the closest labelled examples pixel-by-pixel.

Anomaly detection "too far from anything"

Flag a point if its nearest neighbors are unusually distant — it doesn't fit any known cluster.

Imputation filling missing values

Estimate a missing field by averaging the same field from the K most similar rows.

How it works (the recipe)

Predicting a label with KNN is a four-step routine. There is no model to fit beforehand — every step happens after the new point arrives.

Step 1 store

Keep the entire labelled training set as-is. That's the whole "training" phase.

Step 2 measure

For the new point, compute the distance to every training point.

Step 3 pick K

Sort by distance and keep the K smallest — those are the "nearest neighbors."

Step 4 decide

Classification: take the majority class among the K. Regression: take their average value.

No training cost, real prediction cost

Storing the data is free, but every prediction has to look at every training point. With n training rows that's O(n) work per query — fine for thousands of points, painful for millions.

Watch the vote

The animation drops a new gray query point onto a labelled cloud (blue Class A and red Class B), measures the distance to every training point, picks the K closest, and reads off the majority. At the end it re-runs with K = 1 to show how the answer can flip when fewer neighbors get a say.

Picking K

K is the only real choice you make. It controls how wide a net the vote casts, and it trades off two opposite kinds of mistake.

Small K (e.g. K = 1)
  • Sensitive — picks up fine detail in the boundary
  • Easily fooled by noise — one mislabelled neighbor decides the answer
  • This is overfitting — high variance, low bias
Large K (e.g. K = n)
  • Smooth — averages over many neighbors, ignores noise
  • Can blur out real structure — predictions drift toward the overall majority
  • This is underfitting — high bias, low variance
Two practical tips

For binary classification, prefer an odd K so a vote can never tie. A common starting heuristic is K ≈ √n where n is the size of the training set — then tune up or down using a validation split.

Distance metrics

"Closest" only means something once you pick a way to measure it. The metric is part of the model, not a detail.

Euclidean straight line

The everyday ruler distance: √Σ(aᵢ − bᵢ)². The default for continuous numeric features.

Manhattan city blocks

Sum of absolute differences: Σ|aᵢ − bᵢ|. Useful when axes mean different things.

Cosine angle, not size

Compares the direction of two vectors. Standard for text and embeddings, where length doesn't matter.

Hamming how many differ

Count of positions where two equal-length strings or binary vectors differ.

Scale your features first

Distance is dominated by whichever feature has the largest raw range. If salary goes 0–100,000 and age goes 0–80, salary will swamp age in any Euclidean distance. Standardize (or normalize) every feature before running KNN, or the metric is meaningless.

When it works — and when it doesn't

Works well when
  • The dataset is small to medium and fits in memory
  • The decision boundary is irregular or non-linear
  • You want a model with zero training cost and almost no parameters
  • Features are scaled and roughly equally informative
Struggles when
  • Many dimensions — distances flatten out (the curse of dimensionality)
  • Lots of training points — every prediction scans them all
  • Irrelevant features drown out the useful ones in the distance
  • Classes are imbalanced — the majority class wins votes by sheer numbers