Support Vector Machines
The big idea — the widest street
Many lines can separate two classes. A support vector machine picks the one that leaves the widest empty street between them.
That street is called the margin, and its centre line is the decision boundary (a hyperplane in higher dimensions). Maximising the margin gives the boundary the most breathing room, which tends to make it generalise better to new points.
Only the points sitting right on the edges of the street matter — they "support" the boundary. Move any other point and nothing changes. That's why it's called a support vector machine.
Watch the margin form, then the kernel lift
The animation starts with several valid separators, settles on the maximum-margin one, highlights the support vectors, then shows the trick that rescues data no straight line can split.
Hard margin vs soft margin
Real data is messy — a perfect gap rarely exists. SVMs add a knob, C, that trades off margin width against mistakes.
Demands perfect separation. Brittle — one noisy point can break it or shrink the margin to nothing.
Punishes misclassifications heavily → narrow margin, low bias, risk of overfitting.
Allows some points inside the street → wider margin, smoother boundary, more robust.
The animation claimed the max-margin line is special — test that claim. Steer your own candidate line with the two sliders. While it separates the classes you'll see its street; the readout tracks how close you get to the widest one. The ringed points are the would-be support vectors: the street always rests on them.
Notice how many settings separate the data perfectly — yet only one street is widest. Tilt the line slightly off the optimum: it still classifies the training data flawlessly, but its narrower street means less room for new, slightly-shifted points. That's the SVM's entire argument.
The kernel trick
What if the classes wrap around each other and no line can split them? Instead of drawing curves directly, an SVM lifts the data into a higher-dimensional space where a flat boundary does work — then projects that boundary back down as a curve.
You never actually compute the high-dimensional coordinates. A kernel function computes the needed dot products directly, as if the data had been lifted — cheaply.
A straight boundary. Fast, great when the data is already separable.
Boundaries that bend — adds interaction terms between features.
The popular default — can wrap around almost any shaped cluster.
When to use it
- Data is high-dimensional (text, genomics)
- The dataset is small to medium
- A clear margin of separation exists
- There are millions of rows — training gets slow
- Classes overlap heavily and noise is high
- You need probabilities — SVMs give scores, not calibrated odds