RAG (2020) — Look It Up Before You Answer

The world before this paper

In 2020, everything a language model knew was frozen the moment training stopped. Ask a factual question and the model answered from memory alone — no books, no search, no sources. When the memory was right, it looked like magic. When it was wrong, it looked exactly the same. The field had built fluent talkers with sealed, unverifiable, slowly-rotting memories.

Frozen memory knowledge in weights

Everything the model "knew" was baked into its parameters at training time — frozen, impossible to inspect, and expensive to update.

Fluent but wrong no citations

On knowledge-heavy questions, models answered confidently and incorrectly — with no way to point at a source for any claim.

Welded together retrain to update

Changing a single fact meant retraining the whole model. World knowledge and language skill were fused into the same weights.

The key idea

In 2020, Patrick Lewis and a team at Facebook AI Research made a different bet: a model shouldn't have to memorize the library — give it a library card. They published it as Lewis et al. — "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks", NeurIPS 2020, and wired together two existing pieces. DPR, a dense retriever, turns the question into a vector and finds the closest passages in an indexed Wikipedia. BART, a seq2seq generator, writes the answer while reading those passages. Crucially, the pipeline trains end-to-end: the retriever learns to fetch whatever helps the generator answer well.

The paper even shipped two flavors: RAG-Sequence, which sticks with the same passages for the whole answer, and RAG-Token, which can switch sources token by token — quoting one document mid-sentence and another the next. Want the full mechanics? See RAG mechanics.

The paper in one sentence

Separate knowledge from language: embed the question, let a dense retriever (DPR) pull the most relevant passages from a Wikipedia index, and have a generator (BART) write the answer conditioned on that evidence — trained end-to-end.

From closed book to open book

Watch one niche question travel through both worlds: first a bare model answering from frozen memory, then the same question embedded, matched against a document store, and answered with the receipts attached.

The results that mattered

This wasn't just a nicer architecture diagram. RAG took the top spot on the open-domain QA leaderboards of its day, and its generations were measurably more specific and factual than the parametric-only baseline — same language skill, better memory.

The library 21M passages

The dense Wikipedia index the model reads at answer time. Knowledge lives outside the weights, in a store you can inspect.

New SOTA 3 QA benchmarks

Natural Questions, WebQuestions, CuratedTrec — RAG topped all three, beating models that relied on memorization alone.

Knowledge updates 0 retraining

To teach the model new facts, swap the index. The weights never move — the world changes, the model keeps up.

Legacy — and the catch

Five years on, "RAG" no longer names a specific model — it names the whole pattern behind grounded enterprise assistants and search-augmented chat. Almost every LLM product that answers from your documents is running some descendant of this 2020 recipe.

What it unlocked

Decoupled knowledge from weights — fresh facts without retraining
Answers come with receipts: passages you can audit
Became the standard architecture for grounded LLM products

The limits

Retrieval quality caps answer quality — garbage in, garbage out
The model can still ignore or override what it retrieved
Chunking, indexing, and embedding choices are a whole engineering discipline

Go deeper

Read the original: arXiv:2005.11401. For the machinery behind the story, see RAG mechanics, Vector databases, Hallucinations, and Context windows. Next paper: DDPM (2020).