How RAG Works

rag llm retrieval embeddings

What it is

RAG — Retrieval Augmented Generation — lets a language model answer questions using your documents instead of only what it memorised during training.

The idea is simple: before the model writes an answer, fetch a handful of the most relevant snippets from your knowledge base and paste them into the prompt. The model then composes a reply grounded in those snippets — not in whatever it half-remembers from the open internet.

In one sentence

Give the LLM the right pages from your library before it answers — so it reads first and writes second.

Why we need it

Large language models are powerful but bounded. RAG patches three of their most painful limits.

Knowledge cutoff frozen in time

The model only knows what existed when it was trained. Anything newer — last week's release notes, today's prices — it has never seen.

Hallucination confident but wrong

When the model is unsure, it often invents plausible-sounding facts. Grounding the answer in retrieved text dramatically reduces this.

Private data never trained on

Your wiki, your code, your customer tickets — the model has zero access to any of it unless you bring it in at query time.

The two phases

A RAG system runs in two distinct passes: an offline indexing pipeline you run once (and refresh as data changes), and an online query pipeline that runs every time a user asks something.

Phase 1 — Indexing (offline). Take your documents and turn them into a searchable vector store.

Step 1 Documents

Collect the source material: PDFs, wiki pages, support tickets, code, transcripts.

Step 2 Chunking

Split each document into bite-sized passages — paragraphs or fixed-size windows — small enough to fit in the prompt later.

Step 3 Embeddings

Run every chunk through an embedding model. Each chunk becomes a vector — a list of numbers that captures its meaning.

Step 4 Vector DB

Store the vectors (with the original text) in a vector database that can find nearest neighbours fast.

Phase 2 — Querying (online). When a user asks a question, fetch the most relevant chunks and let the LLM answer.

Step 1 Question

The user's question arrives in natural language.

Step 2 Embed query

Run the question through the same embedding model so it lives in the same vector space as your chunks.

Step 3 Vector search

Find the top-k chunks whose vectors are closest to the question vector — usually by cosine similarity.

Step 4 Build prompt

Stitch the retrieved chunks into a prompt template alongside the question and any system instructions.

Step 5 LLM answers

Send the prompt to the model. It writes the reply using the retrieved context as its source of truth.

Watch the flow

The animation walks the full pipeline end-to-end: documents are chunked and embedded into a vector store, then a question arrives, gets embedded, finds its nearest neighbours, and rides into the LLM as context for the final answer.

Embeddings, in one paragraph

An embedding is a fixed-length list of numbers (often 384, 768, or 1536 of them) produced by a neural network for a piece of text. The trick: texts with similar meaning land at nearby points in this high-dimensional space, so geometric closeness becomes a proxy for semantic similarity. That's why a question about "refund window" can pull back a chunk titled "return policy" even though they share no words — their vectors point in nearly the same direction. Closeness is usually measured with cosine similarity: the cosine of the angle between two vectors, ranging from -1 (opposite) through 0 (unrelated) to 1 (identical direction).

Chunking choices

How you cut up the documents has a bigger impact on quality than most people expect. Three knobs do most of the work.

Chunk size 200 – 1000 tokens

Too small and a chunk loses context. Too big and a single chunk crowds out room for others in the prompt — and dilutes the signal in its embedding.

Overlap 10 – 20%

Let consecutive chunks share a few sentences. Stops a key fact from being chopped in half right at a chunk boundary.

Fixed vs. semantic how to split

Fixed-size windows are simple and fast. Semantic splitting (by section, paragraph, or topic shift) keeps related ideas together and usually retrieves better.

When RAG helps — and when it doesn't

Works well when
  • The answer lives in a specific document the model has never seen
  • The corpus is large — too big to fit in any single prompt
  • Source data changes often and you can't keep retraining
  • You need citations back to the original passages
Struggles when
  • The question requires synthesising across many chunks at once
  • Retrieval brings back off-topic or contradictory passages
  • The answer needs multi-step reasoning beyond what the snippets contain
  • The corpus is tiny — just stuff it all in the prompt directly
Common pitfalls
  • Stale index — documents change but nobody re-embeds them. Answers quietly drift out of date.
  • Lost in the middle — LLMs pay most attention to the start and end of the prompt. Crucial chunks buried in the middle can be ignored.
  • Recall vs. precision — a low k may miss the relevant chunk; a high k floods the prompt with noise. Tune for your data.
  • No fallback — if retrieval returns nothing relevant, the model will still answer something. Detect low similarity scores and say "I don't know" instead.