Seq2Seq + Attention (2014) — Machines Learn to Translate

The world before this paper

In 2013, machine translation was an assembly line, not a brain. A statistical system chopped the source sentence into phrases, looked each one up in a vast phrase table, ranked candidates with separate alignment models and language models, and stitched the winners together. Every component was engineered, trained and tuned on its own. Neural networks were storming vision and speech, but translation had an awkward shape — a sequence in, a different-length sequence out — and nobody had a clean way to learn that mapping in one piece.

Statistical MT a pipeline of parts

Phrase tables, alignment models, language models — each engineered and tuned separately, then bolted together with hand-set weights.

RNNs in 2013 no recipe

Recurrent nets could read a sequence step by step, but mapping a variable-length input to a variable-length output end to end had no clean recipe.

End-to-end MT never competitive

No single network had ever produced competitive translations on its own — earlier neural models only rescored the outputs of statistical systems. It wasn't obvious it was even possible.

The key idea

At Google in 2014, Ilya Sutskever, Oriol Vinyals and Quoc Le made a blunt bet: delete the pipeline. Their NeurIPS 2014 paper — "Sequence to Sequence Learning with Neural Networks" — trained one model, end to end, on nothing but sentence pairs. The recipe was two 4-layer LSTMs glued back to back. An encoder reads the English sentence and folds everything it understood into one fixed-size vector. A decoder takes that vector and unrolls the French, one word at a time, until it emits an end-of-sentence token. The strangest detail in the paper: feeding the source sentence in reverse made optimization dramatically easier — the first words of input and output suddenly sat close together.

There was a catch, and it was structural. Every sentence — 5 words or 50 — had to squeeze through that same fixed-size vector, and on the simpler encoder–decoders of the day quality visibly sagged as sentences grew — Sutskever's reversal trick masked the problem, but the ceiling was real. That same year in Montreal, Dzmitry Bahdanau, Kyunghyun Cho and Yoshua Bengio published the fix: "Neural Machine Translation by Jointly Learning to Align and Translate" (2014). Instead of trusting one summary, their decoder computes — for every output word — a weighted average over all the encoder's states. They called it attention. The bottleneck disappears, long-sentence quality stops degrading, and as a bonus the weights, drawn as a grid, look like soft word alignments: interpretability for free.

The paper in one sentence

Use two RNNs — an encoder reads the source sentence into one fixed vector and a decoder unrolls the translation from it — then Bahdanau's attention freed the decoder to glance back at every source word instead of relying on that single squeezed vector.

Want the full mechanics? See Seq2Seq mechanics.

Watch the bottleneck appear — and disappear

Step through the papers' whole arc: an encoder squeezes a sentence into one thin vector, a decoder unrolls the translation from it, quality sags as sentences grow — then attention fans back across the source and flattens the curve.

The results that mattered

The headline wasn't that neural MT crushed the statistical systems — it was that a single network, trained end to end, matched machinery that had taken a decade of engineering. From there, the trajectory was obvious to everyone.

WMT'14 En→Fr BLEU 34.8

Sutskever's ensemble of reversed-input LSTMs matched the engineered statistical pipelines — pure neural translation, no phrase tables anywhere.

The bottleneck 1 vector → n vectors

Attention let the decoder read all the encoder's states instead of one fixed-size summary. Long-sentence quality stopped degrading.

The countdown 3 years

From Bahdanau's attention mechanism to "Attention Is All You Need". The patch outlived the architecture it was built to fix.

Legacy — and the catch

What it unlocked

First proof that end-to-end neural MT works at scale
Attention became the most important building block in deep learning
The encoder–decoder template still underlies translation and summarization

The limits

RNNs process tokens one at a time — training can't parallelize across the sequence
Even with attention, very long-range structure remained hard for RNNs
The transformer deleted the recurrence three years later

Go deeper

Read the originals: arXiv:1409.3215 and the attention paper, arXiv:1409.0473. For the concepts behind it, see Seq2Seq mechanics, Machine translation, Attention mechanism, and LSTM. Next paper: GANs (2014).