Support Vector Machines

ML classification margin kernel trick supervised

The big idea — the widest street

Many lines can separate two classes. A support vector machine picks the one that leaves the widest empty street between them.

That street is called the margin, and its centre line is the decision boundary (a hyperplane in higher dimensions). Maximising the margin gives the boundary the most breathing room, which tends to make it generalise better to new points.

Support vectors

Only the points sitting right on the edges of the street matter — they "support" the boundary. Move any other point and nothing changes. That's why it's called a support vector machine.

Watch the margin form, then the kernel lift

The animation starts with several valid separators, settles on the maximum-margin one, highlights the support vectors, then shows the trick that rescues data no straight line can split.

Hard margin vs soft margin

Real data is messy — a perfect gap rarely exists. SVMs add a knob, C, that trades off margin width against mistakes.

Hard margin no mistakes allowed

Demands perfect separation. Brittle — one noisy point can break it or shrink the margin to nothing.

Soft margin (large C) few mistakes

Punishes misclassifications heavily → narrow margin, low bias, risk of overfitting.

Soft margin (small C) tolerant

Allows some points inside the street → wider margin, smoother boundary, more robust.

The animation claimed the max-margin line is special — test that claim. Steer your own candidate line with the two sliders. While it separates the classes you'll see its street; the readout tracks how close you get to the widest one. The ringed points are the would-be support vectors: the street always rests on them.

Notice how many settings separate the data perfectly — yet only one street is widest. Tilt the line slightly off the optimum: it still classifies the training data flawlessly, but its narrower street means less room for new, slightly-shifted points. That's the SVM's entire argument.

The kernel trick

What if the classes wrap around each other and no line can split them? Instead of drawing curves directly, an SVM lifts the data into a higher-dimensional space where a flat boundary does work — then projects that boundary back down as a curve.

The clever part

You never actually compute the high-dimensional coordinates. A kernel function computes the needed dot products directly, as if the data had been lifted — cheaply.

Linear no lift

A straight boundary. Fast, great when the data is already separable.

Polynomial curved

Boundaries that bend — adds interaction terms between features.

RBF (Gaussian) flexible blobs

The popular default — can wrap around almost any shaped cluster.

When to use it

Shines when
  • Data is high-dimensional (text, genomics)
  • The dataset is small to medium
  • A clear margin of separation exists
Struggles when
  • There are millions of rows — training gets slow
  • Classes overlap heavily and noise is high
  • You need probabilities — SVMs give scores, not calibrated odds