Attention Is All You Need (2017) — The Transformer
The world before this paper
In 2017, the best translation systems read one word at a time. State-of-the-art models were recurrent networks with attention bolted on the side — and recurrence has a built-in speed limit. Token 50 cannot be processed until token 49 is done, so even a rack of GPUs ends up waiting in line behind a single sequence. The field was winning benchmarks and losing weeks.
RNNs with attention held the state of the art in translation — but they process tokens one at a time, so training can't parallelize across the sequence.
Information between distant tokens had to survive many recurrent steps. The further apart two words sat, the weaker their connection arrived.
Training the best models took weeks of GPU time, and sequence length itself was the bottleneck nobody could engineer around.
The key idea
Eight researchers at Google — Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser & Polosukhin, in "Attention Is All You Need", NeurIPS 2017 — made a bet that sounded reckless. Attention had always been the side dish: a helper module that let an RNN peek back at the source sentence. They threw out the RNN and kept the side dish. In their architecture, every token builds a query, compares it against every other token's key, and pulls in the matching values — all in one matrix operation a GPU can chew through in parallel. Stack that into 6 encoder and 6 decoder layers, run 8 attention heads side by side, and add feed-forward blocks between them.
One snag: attention is order-blind. Shuffle the words and it computes the same thing. So the authors injected sinusoidal positional encodings — a gentle numeric watermark on each token that puts word order back. That was the whole trick. No recurrence, no convolution. Hence the title.
Delete the recurrence and keep only attention: every token attends to every other token in one parallel operation, stacked into layers with feed-forward blocks and multiple heads, with positional encodings to restore word order.
Want the full mechanics? See Transformer architecture.
Watch the bottleneck disappear
Below, the same sentence handled the old way and the new way: the RNN ticks through it token by token, then attention wires every pair at once, heads specialize — and the training clock collapses.
The results that mattered
A radical architecture paper lives or dies on its results table. This one delivered three numbers the field couldn't argue with.
New state of the art on English→German and English→French — the headline that made everyone read past the title.
Trained on 8 P100 GPUs in 3.5 days — a fraction of the weeks the RNN systems it beat had needed.
Any token reaches any other in a single attention hop, no matter the distance. Long-range dependencies stop being a special case.
Legacy — and the catch
- Unlocked massive parallel pretraining — the precondition for LLMs
- One architecture swallowed NLP, then vision, audio and code
- Attention maps gave a window into what the model attends to
- Attention is O(n²) in sequence length — long contexts get expensive
- Needs positional encodings — order isn't native
- Data- and compute-hungry compared to its inductive-bias-rich predecessors
Read the original: arXiv:1706.03762. This is the architecture behind BERT, GPT, ViT, CLIP — effectively everything since. For the mechanics behind the story, see Transformer architecture, Multi-head attention, Positional encoding and Encoder–decoder transformers. Next paper: BERT (2018).