Intro to Unsupervised Learning

ML unsupervised clustering dimensionality reduction

Learning without an answer key

In supervised learning every example comes with the right answer. In unsupervised learning there are no labels at all — just raw data, and the goal is to discover the structure hiding inside it.

That makes it both powerful and tricky: powerful because unlabelled data is cheap and abundant, tricky because there's no answer key to score against. Success is judged by whether the structure it finds is useful.

Supervised vs unsupervised

Supervised: "here are spam emails labelled spam — learn to predict the label." Unsupervised: "here are a million emails — group the similar ones," with nobody saying what the groups should be.

The three core tasks

From one unlabelled cloud of points, watch clustering, dimensionality reduction, and anomaly detection each extract something different.

What each task does

Clustering group similar items

Partition data into groups of similar points. See K-Means and Hierarchical Clustering.

Dimensionality reduction compress features

Squeeze many correlated features into a few, keeping the signal. See PCA.

Anomaly detection spot the odd one

Flag points that don't fit any pattern — fraud, faults, intrusions.

Where it shows up

Customer segmentation marketing

Group customers by behaviour to target each segment differently.

Recommendation find similar items

Cluster products or users to suggest "more like this".

Compression & viz 2D maps

Reduce high-dimensional data to 2D so you can see and explore it.

Pretraining self-supervised

Modern LLMs learn from unlabelled text by inventing their own prediction task.

The honest challenge

Strengths
  • Works on cheap, unlabelled data
  • Reveals patterns nobody thought to look for
  • Great for exploration before modelling
Caveats
  • No ground truth to validate against
  • Results can be hard to interpret
  • Often needs you to choose K or a threshold