Attention Is All You Need (2017) — The Transformer

The world before this paper

In 2017, the best translation systems read one word at a time. State-of-the-art models were recurrent networks with attention bolted on the side — and recurrence has a built-in speed limit. Token 50 cannot be processed until token 49 is done, so even a rack of GPUs ends up waiting in line behind a single sequence. The field was winning benchmarks and losing weeks.

Sequential by design no parallelism

RNNs with attention held the state of the art in translation — but they process tokens one at a time, so training can't parallelize across the sequence.

Fragile memory long-range pain

Information between distant tokens had to survive many recurrent steps. The further apart two words sat, the weaker their connection arrived.

Expensive weeks of GPU time

Training the best models took weeks of GPU time, and sequence length itself was the bottleneck nobody could engineer around.

The key idea

Eight researchers at Google — Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser & Polosukhin, in "Attention Is All You Need", NeurIPS 2017 — made a bet that sounded reckless. Attention had always been the side dish: a helper module that let an RNN peek back at the source sentence. They threw out the RNN and kept the side dish. In their architecture, every token builds a query, compares it against every other token's key, and pulls in the matching values — all in one matrix operation a GPU can chew through in parallel. Stack that into 6 encoder and 6 decoder layers, run 8 attention heads side by side, and add feed-forward blocks between them.

One snag: attention is order-blind. Shuffle the words and it computes the same thing. So the authors injected sinusoidal positional encodings — a gentle numeric watermark on each token that puts word order back. That was the whole trick. No recurrence, no convolution. Hence the title.

The paper in one sentence

Delete the recurrence and keep only attention: every token attends to every other token in one parallel operation, stacked into layers with feed-forward blocks and multiple heads, with positional encodings to restore word order.

Want the full mechanics? See Transformer architecture.

Watch the bottleneck disappear

Below, the same sentence handled the old way and the new way: the RNN ticks through it token by token, then attention wires every pair at once, heads specialize — and the training clock collapses.

The results that mattered

A radical architecture paper lives or dies on its results table. This one delivered three numbers the field couldn't argue with.

Translation quality BLEU 28.4 / 41.8

New state of the art on English→German and English→French — the headline that made everyone read past the title.

Training cost 3.5 days, 8 GPUs

Trained on 8 P100 GPUs in 3.5 days — a small fraction of the GPU time the RNN systems it beat had needed (GNMT: ~9 days on 96 GPUs).

Token distance path length 1

Any token reaches any other in a single attention hop, no matter the distance. Long-range dependencies stop being a special case.

Legacy — and the catch

What it unlocked

Unlocked massive parallel pretraining — the precondition for LLMs
One architecture swallowed NLP, then vision, audio and code
Attention maps gave a window into what the model attends to

The limits

Attention is O(n²) in sequence length — long contexts get expensive
Needs positional encodings — order isn't native
Data- and compute-hungry compared to its inductive-bias-rich predecessors

Go deeper

Read the original: arXiv:1706.03762. This is the architecture behind BERT, GPT, ViT, CLIP — effectively everything since. For the mechanics behind the story, see Transformer architecture, Multi-head attention, Positional encoding and Encoder–decoder transformers. Next paper: BERT (2018).