How RAG Works
What it is
RAG — Retrieval Augmented Generation — lets a language model answer questions using your documents instead of only what it memorised during training.
The idea is simple: before the model writes an answer, fetch a handful of the most relevant snippets from your knowledge base and paste them into the prompt. The model then composes a reply grounded in those snippets — not in whatever it half-remembers from the open internet.
Give the LLM the right pages from your library before it answers — so it reads first and writes second.
Why we need it
Large language models are powerful but bounded. RAG patches three of their most painful limits.
The model only knows what existed when it was trained. Anything newer — last week's release notes, today's prices — it has never seen.
When the model is unsure, it often invents plausible-sounding facts. Grounding the answer in retrieved text dramatically reduces this.
Your wiki, your code, your customer tickets — the model has zero access to any of it unless you bring it in at query time.
The two phases
A RAG system runs in two distinct passes: an offline indexing pipeline you run once (and refresh as data changes), and an online query pipeline that runs every time a user asks something.
Phase 1 — Indexing (offline). Take your documents and turn them into a searchable vector store.
Collect the source material: PDFs, wiki pages, support tickets, code, transcripts.
Split each document into bite-sized passages — paragraphs or fixed-size windows — small enough to fit in the prompt later.
Run every chunk through an embedding model. Each chunk becomes a vector — a list of numbers that captures its meaning.
Store the vectors (with the original text) in a vector database that can find nearest neighbours fast.
Phase 2 — Querying (online). When a user asks a question, fetch the most relevant chunks and let the LLM answer.
The user's question arrives in natural language.
Run the question through the same embedding model so it lives in the same vector space as your chunks.
Find the top-k chunks whose vectors are closest to the question vector — usually by cosine similarity.
Stitch the retrieved chunks into a prompt template alongside the question and any system instructions.
Send the prompt to the model. It writes the reply using the retrieved context as its source of truth.
Watch the flow
The animation walks the full pipeline end-to-end: documents are chunked and embedded into a vector store, then a question arrives, gets embedded, finds its nearest neighbours, and rides into the LLM as context for the final answer.
Embeddings, in one paragraph
An embedding is a fixed-length list of numbers (often 384, 768, or 1536 of them) produced by a neural network for a piece of text. The trick: texts with similar meaning land at nearby points in this high-dimensional space, so geometric closeness becomes a proxy for semantic similarity. That's why a question about "refund window" can pull back a chunk titled "return policy" even though they share no words — their vectors point in nearly the same direction. Closeness is usually measured with cosine similarity: the cosine of the angle between two vectors, ranging from -1 (opposite) through 0 (unrelated) to 1 (identical direction).
Chunking choices
How you cut up the documents has a bigger impact on quality than most people expect. Three knobs do most of the work.
Too small and a chunk loses context. Too big and a single chunk crowds out room for others in the prompt — and dilutes the signal in its embedding.
Let consecutive chunks share a few sentences. Stops a key fact from being chopped in half right at a chunk boundary.
Fixed-size windows are simple and fast. Semantic splitting (by section, paragraph, or topic shift) keeps related ideas together and usually retrieves better.
When RAG helps — and when it doesn't
- The answer lives in a specific document the model has never seen
- The corpus is large — too big to fit in any single prompt
- Source data changes often and you can't keep retraining
- You need citations back to the original passages
- The question requires synthesising across many chunks at once
- Retrieval brings back off-topic or contradictory passages
- The answer needs multi-step reasoning beyond what the snippets contain
- The corpus is tiny — just stuff it all in the prompt directly
- Stale index — documents change but nobody re-embeds them. Answers quietly drift out of date.
- Lost in the middle — LLMs pay most attention to the start and end of the prompt. Crucial chunks buried in the middle can be ignored.
- Recall vs. precision — a low
kmay miss the relevant chunk; a highkfloods the prompt with noise. Tune for your data. - No fallback — if retrieval returns nothing relevant, the model will still answer something. Detect low similarity scores and say "I don't know" instead.