BERT (2018) — Reading in Both Directions

Transformer Era 2018 pretraining fine-tuning

The world before this paper

In 2018, language models read with one eye covered. The best of them scanned text strictly left to right — perfect for predicting the next word, awkward for understanding a finished sentence. But meaning rarely flows one way. The word bank only settles after you see what follows it. The field had pieces of an answer; nobody had a model that read deeply in both directions at once.

One-way reading left → right

GPT-1 read strictly forward. Fine for generating text — but understanding wants context from both sides of every word.

ELMo shallow glue

It bolted two one-directional models together. Technically bidirectional — but only skin-deep, joined at the very last layer.

Task silos scratch training

Every NLP task still had its own architecture, trained mostly from scratch on whatever scarce labeled data existed.

The key idea

In late 2018, four researchers at Google — Devlin, Chang, Lee & Toutanova — "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", NAACL 2019 — stared down the obvious objection. You can't just run a normal language model in both directions: stack a few bidirectional layers and every word can peek at itself through the back door. Prediction becomes copying. Everyone knew this, which is why everyone settled for one-way models or shallow glue.

Their bet was to change the game instead of the model. Make the text a puzzle: hide 15% of the tokens behind a [MASK], and force a deep transformer encoder to reconstruct them using everything else — left and right at once. No peeking possible, because the answer is gone. Add a second drill (guess whether sentence B really follows sentence A), feed it BooksCorpus plus English Wikipedia — roughly 3.3 billion words of free, unlabeled text — and let two sizes soak it up: BERT-base at 110M parameters, BERT-large at 340M. Then the second half of the bet: that one pretrained network, plus a tiny task-specific output layer, would beat every hand-crafted architecture in the field.

The paper in one sentence

Pretrain a deep transformer encoder by hiding 15% of the tokens and predicting them from both directions, then fine-tune that single network for any task by attaching a small output head.

Want the full mechanics? See BERT mechanics.

Watch the blank get filled

One masked sentence, five scenes: the puzzle itself, a one-directional model shrugging at it, BERT reading both ways, the pretrain-then-fine-tune recipe, and the scoreboard that ended the argument. Press play.

The results that mattered

The numbers landed like a verdict. One pretrained encoder, lightly fine-tuned, swept benchmarks that had each been someone's specialty — and on SQuAD v1.1 it pushed F1 to 93.2, past the human benchmark.

GLUE +7.7

An absolute jump over the previous state of the art, landing at 80.5%. Benchmarks usually move in fractions, not sevens.

One model 11 tasks

New state of the art on eleven NLP tasks simultaneously — one pretrained model, a thin head per task, no custom architectures.

Training signal 15%

The fraction of tokens masked. The whole training signal is filling blanks — a game any pile of raw text can supply.

Legacy — and the catch

What it unlocked
  • "Pretrain once, fine-tune cheaply" became NLP's standard recipe
  • Small labeled datasets suddenly sufficed
  • Encoder-only models still run search, ranking and classification at scale
The limits
  • Can't generate text — masks don't teach left-to-right writing
  • Next-sentence prediction turned out nearly useless (RoBERTa dropped it)
  • The fine-tune-per-task world later gave way to prompting one giant model

Within a year the descendants arrived — RoBERTa trained it harder, ALBERT shrank it, DistilBERT compressed it — and a decade on, encoder-only transformers still quietly power search boxes and classifiers everywhere. The deeper legacy is the recipe itself: transfer learning stopped being a vision trick and became how all of NLP works.

Go deeper

Read the original: arXiv:1810.04805. For the machinery behind the story, see BERT mechanics, transfer learning, and subword tokenization (BPE). Next paper: GPT-3 (2020).