K-Nearest Neighbors
What it is
K-Nearest Neighbors (KNN) is one of the simplest ideas in machine learning: to label something new, ask its closest neighbors what they are, and go with the majority.
Imagine you move to a new neighborhood and want to guess whether a stranger is a coffee drinker or a tea drinker. The fastest way is to look at the five people who live closest to them and just count. If four of them drink coffee, the stranger probably does too. KNN does exactly that, but on numbers — it stores all the labelled examples it has ever seen, and when a new example shows up, it finds the K closest ones and copies their answer.
To predict the label of a new point, find the K training points closest to it and let them vote.
Three knobs control the whole algorithm:
The number of nearest training points that get a say. Small K listens closely; large K averages widely.
The rule that decides which points are "near." Usually plain straight-line (Euclidean) distance.
Classification takes the majority class. Regression averages the neighbors' values.
KNN does no real training — it just memorizes the dataset. All the work happens at prediction time, when it has to scan the stored points and find the closest ones.
Where it's used
Anywhere a "things that look similar tend to share a label" rule of thumb works — and the dataset is small enough that scanning it is cheap.
Find the users most similar to you and recommend what they liked.
Classify a scanned digit by comparing it to the closest labelled examples pixel-by-pixel.
Flag a point if its nearest neighbors are unusually distant — it doesn't fit any known cluster.
Estimate a missing field by averaging the same field from the K most similar rows.
How it works (the recipe)
Predicting a label with KNN is a four-step routine. There is no model to fit beforehand — every step happens after the new point arrives.
Keep the entire labelled training set as-is. That's the whole "training" phase.
For the new point, compute the distance to every training point.
Sort by distance and keep the K smallest — those are the "nearest neighbors."
Classification: take the majority class among the K. Regression: take their average value.
Storing the data is free, but every prediction has to look at every training point. With n training rows that's O(n) work per query — fine for thousands of points, painful for millions.
Watch the vote
The animation drops a new gray query point onto a labelled cloud (blue Class A and red Class B), measures the distance to every training point, picks the K closest, and reads off the majority. At the end it re-runs with K = 1 to show how the answer can flip when fewer neighbors get a say.
Picking K
K is the only real choice you make. It controls how wide a net the vote casts, and it trades off two opposite kinds of mistake.
- Sensitive — picks up fine detail in the boundary
- Easily fooled by noise — one mislabelled neighbor decides the answer
- This is overfitting — high variance, low bias
- Smooth — averages over many neighbors, ignores noise
- Can blur out real structure — predictions drift toward the overall majority
- This is underfitting — high bias, low variance
For binary classification, prefer an odd K so a vote can never tie. A common starting heuristic is K ≈ √n where n is the size of the training set — then tune up or down using a validation split.
Distance metrics
"Closest" only means something once you pick a way to measure it. The metric is part of the model, not a detail.
The everyday ruler distance: √Σ(aᵢ − bᵢ)². The default for continuous numeric features.
Sum of absolute differences: Σ|aᵢ − bᵢ|. Useful when axes mean different things.
Compares the direction of two vectors. Standard for text and embeddings, where length doesn't matter.
Count of positions where two equal-length strings or binary vectors differ.
Distance is dominated by whichever feature has the largest raw range. If salary goes 0–100,000 and age goes 0–80, salary will swamp age in any Euclidean distance. Standardize (or normalize) every feature before running KNN, or the metric is meaningless.
When it works — and when it doesn't
- The dataset is small to medium and fits in memory
- The decision boundary is irregular or non-linear
- You want a model with zero training cost and almost no parameters
- Features are scaled and roughly equally informative
- Many dimensions — distances flatten out (the curse of dimensionality)
- Lots of training points — every prediction scans them all
- Irrelevant features drown out the useful ones in the distance
- Classes are imbalanced — the majority class wins votes by sheer numbers