ViT (2020) — An Image Is Worth 16×16 Words

The world before this paper

In 2020, computer vision had a ruling dynasty: the convolutional neural network. Every breakthrough of the deep learning decade — AlexNet, ResNet, EfficientNet — was a CNN. Convolutions came with beliefs about images built into the wiring: nearby pixels matter most, and a cat in the corner is the same cat in the center. Those priors seemed not just useful but necessary. Meanwhile, over in language, transformers had taken everything. Vision researchers noticed — but the obvious move looked impossible.

The CNN monopoly priors built in

Locality and translation invariance were baked into the architecture itself — and the field assumed images could not be learned without them.

Attention as garnish a sprinkle on convs

Self-attention appeared in vision only as an add-on — a block bolted onto a convolutional backbone, never the main act.

The quadratic wall pixels² explodes

Transformers had eaten NLP, but attention cost grows with sequence length squared — and images carry hundreds of thousands of pixels.

The key idea

A team at Google Brain — Dosovitskiy et al., "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale", ICLR 2021 — made a deliberately stubborn bet: change nothing about the transformer, and change the image instead. If pixels are too many to attend over, stop attending over pixels. Cut the image into 16×16 squares and treat each square the way NLP treats a word. A 224×224 photo collapses into just 196 patch tokens — a sentence-sized sequence. Add a learnable [CLS] token to soak up the summary, put a small classification head on top of it, and let a plain encoder do everything else. No convolutions. Anywhere.

The deeper wager was philosophical: maybe convolution's beloved priors aren't laws of vision — just training wheels. Given enough data, a transformer should learn locality on its own, without ever being limited by it.

The paper in one sentence

Don't attend over pixels — slice the image into 16×16 patches, linearly embed each patch as a "word", add position embeddings, and feed the sequence to a completely standard transformer encoder.

Want the full mechanics? See Transformer architecture.

Watch an image become a sentence

Follow one photo through ViT: it gets diced into patches, the patches unroll into position-stamped tokens, a [CLS] token joins the front, attention wires every patch to every other from the very first layer — and the head reads out the answer.

The results that mattered

ViT's results split cleanly along one axis: data. With modest data, the convolution-free model loses — it has to relearn from scratch what CNNs get for free. With enormous data, it pulls ahead, and does so using less pretraining compute than its rivals.

Sequence length 196 tokens

A whole 224×224 image becomes just 196 patch tokens — a sequence short enough that quadratic attention is entirely affordable.

The crossover 300M images

Trained only on ImageNet, ViT loses to CNNs. Pretrained on JFT-300M, it beats state-of-the-art CNNs — with less pretraining compute.

Receptive field layer 1

ViT's view is already global at the first layer: any patch can attend to any other immediately. A CNN waits many layers for that reach.

Legacy — and the catch

What it unlocked

Unified vision and language under one architecture
Scales beautifully — more data and compute keep paying off
Backbone of the multimodal era (CLIP and beyond)

The limits

Data-hungry: below huge pretraining scale, CNNs win
Quadratic attention caps resolution; high-res needs tricks
Patches are a crude tokenization — fine detail can fall between them

Go deeper

Read the original: arXiv:2010.11929. For the machinery under the hood, see Transformer architecture — and for the worldview ViT displaced, Why CNNs and CNN architecture. Next paper: CLIP (2021).