K-Means Clustering · Suman Bhadra Notes

What it is

K-Means is the go-to algorithm for finding structure in data that has no labels at all: it sorts a pile of points into K groups so that every point sits as close as possible to the center of its group.

Picture an empty map of a region with towns scattered across it, and you have been asked to build K post offices. You start by dropping K flags more or less at random. Every town then walks to its nearest flag — that is its post office. Now you look at each flag and notice it is not really in the middle of the towns that chose it, so you slide the flag to the exact center of its towns. But once the flags move, some towns are now closer to a different flag, so they switch. You re-center the flags again. Repeat this — towns pick the nearest flag, flags move to the middle of their towns — and after a few rounds nothing changes anymore. K-Means does exactly this with data points and "centroids."

In one sentence

Place K centroids, assign every point to its nearest centroid, move each centroid to the mean of its points, and repeat until the centroids stop moving.

This is not KNN — even though both start with "K"

K-Means and K-Nearest Neighbors sound alike and both have a "K", but they are opposites. KNN is supervised: it needs labelled examples and predicts the label of one new point by polling its neighbors. K-Means is unsupervised: there are no labels — it discovers groups on its own. And the "K" means different things: in KNN it is the number of neighbors that vote; in K-Means it is the number of clusters you want to find.

Two pieces define the whole algorithm:

K how many clusters

The number of groups you ask the algorithm to find. You must choose this up front — the data will not tell you.

Centroid a cluster's center

The mean position of all points currently assigned to a cluster. It need not be an actual data point.

Assignment point → nearest centroid

Each point belongs to whichever centroid is closest, by straight-line (Euclidean) distance.

No labels, no teacher

K-Means never sees a "correct answer." It only sees the raw coordinates of the points and tries to organize them. That is what makes it unsupervised learning — it finds patterns rather than reproducing known labels.

Where it's used

Anywhere you have unlabelled data and a hunch that it falls into a handful of natural groups.

Customer segmentation marketing

Group customers by spending, frequency, and recency so each segment can be targeted differently — without anyone hand-labelling them first.

Image compression color quantization

Cluster the millions of pixel colors into K representative colors, then store each pixel as its cluster's color — shrinking the palette dramatically.

Document grouping topic discovery

Cluster articles or tickets by their feature vectors to surface themes nobody defined ahead of time.

Anomaly detection "far from every center"

Points that sit far from every centroid don't belong to any tight cluster and can be flagged as unusual.

How it works (the recipe)

K-Means is a short loop that repeats two steps — assign then update — until things settle.

Step 1 initialize

Choose K, then place K centroids — at random, or smartly with k-means++ (more below).

Step 2 assign

Give every point to the centroid nearest to it. This carves the space into K groups.

Step 3 update

Move each centroid to the mean (average position) of all the points just assigned to it.

Step 4 repeat

Re-assign, re-update, again and again. Stop when no point changes group — the centroids have stopped moving. That's convergence.

What it is actually minimizing

Each assign/update pass lowers the within-cluster sum of squares — the total squared distance from every point to its centroid, also called inertia. The loop is guaranteed to converge, but only to a local optimum: a bad starting placement can lock it into a mediocre answer. That is exactly why initialization matters, and why k-means++ — which spreads the initial centroids apart on purpose — is the standard default.

Watch the centroids settle

The animation starts with 15 unlabelled gray points and three centroids (diamonds) parked deliberately off to one side. Watch the loop: in the ASSIGN step every point takes the color of its nearest centroid; in the UPDATE step each centroid slides to the mean of its newly colored points. After a couple of rounds the centroids land in the middle of the three natural blobs and stop moving — convergence.

Choosing K

K-Means cannot guess how many clusters you want — you have to tell it. Two methods help you pick a sensible number.

Elbow method inertia vs K

Run K-Means for K = 1, 2, 3, … and plot the inertia each time. Inertia always falls as K grows, but at the "right" K the curve bends sharply and then flattens — that elbow is your candidate K.

Silhouette score tight and separated?

For each point, compare how close it is to its own cluster versus the nearest other cluster. The score runs from −1 to 1; pick the K that maximizes the average — clusters that are both compact and well separated.

K is a choice, and bigger is not better

There is no K that the data simply "wants." Push K too high and inertia keeps dropping toward zero — in the limit every point becomes its own cluster, which fits the data perfectly and tells you nothing. The elbow and silhouette exist precisely to fight that temptation.

Run the loop yourself. Pick a K, then hit Step to do one ASSIGN + UPDATE pass — points recolor to their nearest centroid and each diamond slides to its cluster's mean. Watch inertia drop every step until it reads "converged." Now drag the diamonds to a bad starting spot before stepping, or use Reset (new seeds) to cycle through different initial placements, and see how a poor start can settle into a worse answer (higher final inertia) — that is the local-optimum trap.

K

cluster 1 cluster 2 cluster 3 cluster 4 unassigned

Gotchas

K-Means is fast and intuitive, but it leans on some quiet assumptions. Break them and the clusters it returns can be misleading.

Assumptions it makes

Clusters are roughly spherical (round blobs), not stretched or curved
Clusters are of similar size and density
Distance in every direction means the same thing

What trips it up

Feature scaling — a large-range feature dominates the distance; standardize first
Initialization — a poor start lands in a worse local optimum; use k-means++ and several restarts
Outliers — a single far-off point drags its centroid (the mean) off to one side

Standardize before you cluster

Because K-Means measures distance, whichever feature has the biggest raw range silently controls the result. If income spans 0–100,000 and age spans 0–80, income alone decides every assignment. Standardize (or normalize) all features first, or the clusters are meaningless.

When it works — and when it doesn't

Works well when

It is fast and simple — easy to run, easy to explain
It scales to large datasets with many points
Clusters are compact and roughly round
You roughly know how many groups you expect

Struggles when

You must pick K up front and have no idea what it should be
Clusters are non-spherical or have very different densities
The data has outliers or unscaled features
It only finds a local optimum — different starts give different answers

When K-Means is the wrong tool

For elongated, nested, or oddly shaped clusters, try density-based methods like DBSCAN (which also finds the number of clusters for you and tolerates outliers) or Gaussian Mixture Models (which allow stretched, overlapping clusters). K-Means is the right first reach for clean, round, well-separated groups.