t-SNE & UMAP · Suman Bhadra Notes

Seeing data with hundreds of dimensions

A row of data can have hundreds of features — pixels, word vectors, gene levels. You can't plot 300 axes. t-SNE and UMAP solve this: they squeeze the data down to 2 dimensions you can actually look at, arranged so that points which were close in high-D stay close in 2D. Hidden clusters jump out of the page.

vs PCA curved, not flat

PCA finds straight directions of variance; t-SNE/UMAP follow curved structure (manifolds), so they separate clusters far better for viewing.

Local focus keep neighbours

Both prioritize getting each point's nearest neighbours right, even at the cost of global layout.

UMAP faster, more global

UMAP is quicker on big data and preserves more of the global shape than t-SNE.

From tangled high-D to a readable map

Conceptually: measure who's near whom in the original space, then drop everything onto a 2D canvas and nudge points until their neighbourhoods match — neighbours attract, strangers repel.

Read the map carefully

What the picture means — and doesn't

Tight clumps are real: those points are genuinely similar. But distances between clusters, cluster sizes, and empty gaps are not reliable — t-SNE in particular distorts them, and the result changes with settings like perplexity. Use these plots to spot structure and sanity-check embeddings, never to measure how far apart groups "really" are.

Great for

Visualizing clusters in embeddings
Sanity-checking a model's learned features
Exploring unlabelled data

Not for

Measuring true distances between groups
Features fed into another model (t-SNE is non-parametric — it regenerates the layout each run and can't map new points; UMAP can project new points onto an existing embedding via transform(), but the layout still serves visualization better than model features)
Reading meaning into gap sizes