RAG (2020) — Look It Up Before You Answer
The world before this paper
In 2020, everything a language model knew was frozen the moment training stopped. Ask a factual question and the model answered from memory alone — no books, no search, no sources. When the memory was right, it looked like magic. When it was wrong, it looked exactly the same. The field had built fluent talkers with sealed, unverifiable, slowly-rotting memories.
Everything the model "knew" was baked into its parameters at training time — frozen, impossible to inspect, and expensive to update.
On knowledge-heavy questions, models answered confidently and incorrectly — with no way to point at a source for any claim.
Changing a single fact meant retraining the whole model. World knowledge and language skill were fused into the same weights.
The key idea
In 2020, Patrick Lewis and a team at Facebook AI Research made a different bet: a model shouldn't have to memorize the library — give it a library card. They published it as Lewis et al. — "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks", NeurIPS 2020, and wired together two existing pieces. DPR, a dense retriever, turns the question into a vector and finds the closest passages in an indexed Wikipedia. BART, a seq2seq generator, writes the answer while reading those passages. Crucially, the pipeline trains end-to-end: the retriever learns to fetch whatever helps the generator answer well.
The paper even shipped two flavors: RAG-Sequence, which sticks with the same passages for the whole answer, and RAG-Token, which can switch sources token by token — quoting one document mid-sentence and another the next. Want the full mechanics? See RAG mechanics.
Separate knowledge from language: embed the question, let a dense retriever (DPR) pull the most relevant passages from a Wikipedia index, and have a generator (BART) write the answer conditioned on that evidence — trained end-to-end.
From closed book to open book
Watch one niche question travel through both worlds: first a bare model answering from frozen memory, then the same question embedded, matched against a document store, and answered with the receipts attached.
The results that mattered
This wasn't just a nicer architecture diagram. RAG took the top spot on the open-domain QA leaderboards of its day, and its generations were measurably more specific and factual than the parametric-only baseline — same language skill, better memory.
The dense Wikipedia index the model reads at answer time. Knowledge lives outside the weights, in a store you can inspect.
Natural Questions, WebQuestions, CuratedTrec — RAG topped all three, beating models that relied on memorization alone.
To teach the model new facts, swap the index. The weights never move — the world changes, the model keeps up.
Legacy — and the catch
Five years on, "RAG" no longer names a specific model — it names the whole pattern behind grounded enterprise assistants and search-augmented chat. Almost every LLM product that answers from your documents is running some descendant of this 2020 recipe.
- Decoupled knowledge from weights — fresh facts without retraining
- Answers come with receipts: passages you can audit
- Became the standard architecture for grounded LLM products
- Retrieval quality caps answer quality — garbage in, garbage out
- The model can still ignore or override what it retrieved
- Chunking, indexing, and embedding choices are a whole engineering discipline
Read the original: arXiv:2005.11401. For the machinery behind the story, see RAG mechanics, Vector databases, Hallucinations, and Context windows. Next paper: DDPM (2020).