t-SNE & UMAP

ML dimensionality reduction visualization embeddings

Seeing data with hundreds of dimensions

A row of data can have hundreds of features — pixels, word vectors, gene levels. You can't plot 300 axes. t-SNE and UMAP solve this: they squeeze the data down to 2 dimensions you can actually look at, arranged so that points which were close in high-D stay close in 2D. Hidden clusters jump out of the page.

vs PCA curved, not flat

PCA finds straight directions of variance; t-SNE/UMAP follow curved structure (manifolds), so they separate clusters far better for viewing.

Local focus keep neighbours

Both prioritize getting each point's nearest neighbours right, even at the cost of global layout.

UMAP faster, more global

UMAP is quicker on big data and preserves more of the global shape than t-SNE.

From tangled high-D to a readable map

Conceptually: measure who's near whom in the original space, then drop everything onto a 2D canvas and nudge points until their neighbourhoods match — neighbours attract, strangers repel.

Read the map carefully

What the picture means — and doesn't

Tight clumps are real: those points are genuinely similar. But distances between clusters, cluster sizes, and empty gaps are not reliable — t-SNE in particular distorts them, and the result changes with settings like perplexity. Use these plots to spot structure and sanity-check embeddings, never to measure how far apart groups "really" are.

Great for
  • Visualizing clusters in embeddings
  • Sanity-checking a model's learned features
  • Exploring unlabelled data
Not for
  • Measuring true distances between groups
  • Features fed into another model (non-parametric — it regenerates the layout each run and can't map new points)
  • Reading meaning into gap sizes