ViT (2020) — An Image Is Worth 16×16 Words
The world before this paper
In 2020, computer vision had a ruling dynasty: the convolutional neural network. Every breakthrough of the deep learning decade — AlexNet, ResNet, EfficientNet — was a CNN. Convolutions came with beliefs about images built into the wiring: nearby pixels matter most, and a cat in the corner is the same cat in the center. Those priors seemed not just useful but necessary. Meanwhile, over in language, transformers had taken everything. Vision researchers noticed — but the obvious move looked impossible.
Locality and translation invariance were baked into the architecture itself — and the field assumed images could not be learned without them.
Self-attention appeared in vision only as an add-on — a block bolted onto a convolutional backbone, never the main act.
Transformers had eaten NLP, but attention cost grows with sequence length squared — and images carry hundreds of thousands of pixels.
The key idea
A team at Google Brain — Dosovitskiy et al., "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale", ICLR 2021 — made a deliberately stubborn bet: change nothing about the transformer, and change the image instead. If pixels are too many to attend over, stop attending over pixels. Cut the image into 16×16 squares and treat each square the way NLP treats a word. A 224×224 photo collapses into just 196 patch tokens — a sentence-sized sequence. Add a learnable [CLS] token to soak up the summary, put a small classification head on top of it, and let a plain encoder do everything else. No convolutions. Anywhere.
The deeper wager was philosophical: maybe convolution's beloved priors aren't laws of vision — just training wheels. Given enough data, a transformer should learn locality on its own, without ever being limited by it.
Don't attend over pixels — slice the image into 16×16 patches, linearly embed each patch as a "word", add position embeddings, and feed the sequence to a completely standard transformer encoder.
Want the full mechanics? See Transformer architecture.
Watch an image become a sentence
Follow one photo through ViT: it gets diced into patches, the patches unroll into position-stamped tokens, a [CLS] token joins the front, attention wires every patch to every other from the very first layer — and the head reads out the answer.
The results that mattered
ViT's results split cleanly along one axis: data. With modest data, the convolution-free model loses — it has to relearn from scratch what CNNs get for free. With enormous data, it pulls ahead, and does so using less pretraining compute than its rivals.
A whole 224×224 image becomes just 196 patch tokens — a sequence short enough that quadratic attention is entirely affordable.
Trained only on ImageNet, ViT loses to CNNs. Pretrained on JFT-300M, it beats state-of-the-art CNNs — with less pretraining compute.
ViT's view is already global at the first layer: any patch can attend to any other immediately. A CNN waits many layers for that reach.
Legacy — and the catch
- Unified vision and language under one architecture
- Scales beautifully — more data and compute keep paying off
- Backbone of the multimodal era (CLIP and beyond)
- Data-hungry: below huge pretraining scale, CNNs win
- Quadratic attention caps resolution; high-res needs tricks
- Patches are a crude tokenization — fine detail can fall between them
Read the original: arXiv:2010.11929. For the machinery under the hood, see Transformer architecture — and for the worldview ViT displaced, Why CNNs and CNN architecture. Next paper: CLIP (2021).