CLIP (2021) — One Space for Words and Pictures
The world before this paper
In 2020, every vision model was a prisoner of its label list. An ImageNet classifier could spot a golden retriever with superhuman confidence — but show it anything outside its 1,000 memorized classes and it had no way to answer. Vision and language lived on separate planets: image models fed on expensive hand-labeled datasets, while right next door the web held hundreds of millions of images already described, for free, by the people who posted them.
A supervised classifier outputs one of a fixed set of labels. Anything outside that set simply does not exist for it.
Adding a single category meant collecting thousands of hand-labels and training all over again. Labels are slow and expensive.
The web already paired hundreds of millions of images with natural-language captions — and nobody was learning from them.
The key idea
In early 2021, Alec Radford and colleagues at OpenAI — the same Radford behind the GPT papers — published "Learning Transferable Visual Models From Natural Language Supervision" (Radford et al., ICML 2021). Their bet was almost contrarian in its simplicity: stop curating labels and let the messy web be the teacher. They scraped 400 million image–caption pairs, built a ViT (or ResNet) image encoder and a transformer text encoder, and trained both on one contrastive game — in every batch, each image must hug its own caption and shun everyone else's.
The payoff is the trick that gives this article its title. Once words and pictures share one space, a classifier is just a sentence. Want to recognize zebras? Embed "a photo of a zebra" and check which images land nearby. No labels, no retraining — and yes, the phrasing matters: "a photo of a {label}" reliably beats the bare label.
Train an image encoder and a text encoder together — contrastively, on 400M web pairs — so that true image–caption pairs land close in one shared embedding space and mismatched pairs land far apart, turning classification into retrieval: just describe the class in words.
Want the full mechanics of "close" and "far"? See Cosine similarity.
Watch the diagonal light up
The whole paper fits in one picture. A batch of N images and their N captions form an N × N similarity matrix; training pulls the diagonal bright and pushes everything else dark. Once that geometry exists, zero-shot classification falls out for free — step through it below.
The results that mattered
The numbers landed like a thunderclap. A model that had never seen a single ImageNet label matched the original supervised ResNet-50 on ImageNet — and when the test images turned into sketches or adversarial renditions, supervised models crumbled while CLIP barely flinched.
Image–text pairs scraped from the web. Captions as free supervision — no annotators, no taxonomy.
Matches the supervised baseline with zero ImageNet labels — and stays robust where supervised models break.
Every batch of N true pairs supplies N² − N mismatches to push apart. The contrast is the curriculum.
Legacy — and the catch
CLIP's shared space quietly became infrastructure. Its text encoder is the steering wheel inside DALL·E 2 and Stable Diffusion — when you type a prompt and a picture appears, CLIP's geometry is the bridge the words crossed.
- Open-vocabulary vision: describe a class, get a classifier
- One shared space powers search, retrieval, and generation guidance
- Robust where supervised models are brittle
- Inherits web-scale biases wholesale
- Weak at counting, spatial relations, fine-grained reading
- Contrastive embeddings ≠ understanding — it matches, it doesn't reason
Read the original: arXiv:2103.00020 (Radford et al., ICML 2021). For the geometry, see Cosine similarity; for what CLIP steers, Diffusion models; for where its embeddings live in production, Vector databases. Next paper: DDPM (2020).