t-SNE & UMAP
Seeing data with hundreds of dimensions
A row of data can have hundreds of features — pixels, word vectors, gene levels. You can't plot 300 axes. t-SNE and UMAP solve this: they squeeze the data down to 2 dimensions you can actually look at, arranged so that points which were close in high-D stay close in 2D. Hidden clusters jump out of the page.
PCA finds straight directions of variance; t-SNE/UMAP follow curved structure (manifolds), so they separate clusters far better for viewing.
Both prioritize getting each point's nearest neighbours right, even at the cost of global layout.
UMAP is quicker on big data and preserves more of the global shape than t-SNE.
From tangled high-D to a readable map
Conceptually: measure who's near whom in the original space, then drop everything onto a 2D canvas and nudge points until their neighbourhoods match — neighbours attract, strangers repel.
Read the map carefully
Tight clumps are real: those points are genuinely similar. But distances between clusters, cluster sizes, and empty gaps are not reliable — t-SNE in particular distorts them, and the result changes with settings like perplexity. Use these plots to spot structure and sanity-check embeddings, never to measure how far apart groups "really" are.
- Visualizing clusters in embeddings
- Sanity-checking a model's learned features
- Exploring unlabelled data
- Measuring true distances between groups
- Features fed into another model (non-parametric — it regenerates the layout each run and can't map new points)
- Reading meaning into gap sizes