RAG from First Principles

One-line summary: RAG doesn't make a language model know more — it slips the right text into the prompt at query time so the model can answer from a source instead of from memory.

Why this exists / what it solves: A language model answers from frozen weights. If a fact was never in its training data (private, recent, proprietary) or was seen but not memorized, the model doesn't say "I don't know" — it produces something equally fluent and possibly wrong, with no tell. You can't fix this by inspecting the output, because grounded and fabricated answers look identical. RAG fixes it upstream: fetch a relevant source first, put it in the prompt, and bias the model toward answering from it.

What follows is the reasoning chain that gets you there, each link a question.

What problem does RAG actually solve?

Not "hallucination," exactly. The model can't tell you when it's guessing, and you can't tell from the text either — it's confident both ways. RAG's job is to supply a source the answer can rest on. It doesn't detect fabrication; it makes grounding available.

So does RAG guarantee a grounded answer?

No. This is the most-skipped truth in the whole topic. The model can still ignore the retrieved text and answer from its weights.

context: "Returns accepted within 30 days of delivery."
question: "What's the return window?"
model:    "Most stores allow 90 days."   # ignored the context — still possible

RAG biases toward the context. It does not bind the model to it. Everything else follows from accepting this.

What did we use before RAG?

Two different things people often confuse:

Keyword search (TF-IDF, BM25) — for finding text. Decades old, still works, no neural nets.
Fine-tuning — for teaching a model, by nudging its weights on domain data.

Fine-tuning is not retraining-from-scratch. Full per-domain retraining was never standard — too expensive. The real predecessors are the two above.

Why not just fine-tune the knowledge in?

Fine-tuning bakes data in statically and costs a training run every time the data changes.

fine-tune:  new data  -> training run -> new weights   (hours/$, per update)
retrieve:   new data  -> insert row   -> done          (instant, no weights touched)

Updating a row is cheap and immediate; updating weights is neither. Retrieval wins when data changes often or is private. Fine-tuning still wins for behavior and style. Different tools.

Why can't the model just know it?

Because one of three things is true: the fact was never in training, was seen but not memorized, or was memorized but not reliably recalled. You can't distinguish these from the output. So you stop trying to diagnose memory and just supply the source.

What is retrieval, minimally?

Any step that, given a query, returns relevant text from an external source at query time. That's it. No embeddings in the definition.

-- This is still RAG:
SELECT chunk FROM docs WHERE chunk LIKE '%return window%';
-- feed the hit into the prompt -> model answers from it

If keyword search finds the text and you put it in the prompt, that is RAG. Embeddings are one implementation, not the price of entry.

Then why use embeddings at all?

Because keyword search only matches words, and a question rarely shares words with its answer:

Q: "When is the return window?"
A: "Items may be sent back within 30 days of delivery."
   ^ almost no overlapping words — keyword search can miss this

An embedding model is trained so that text used in similar contexts lands nearby in vector space (and retrieval embedders are further trained on question–answer pairs). So it can place the question near its answer despite zero shared words. That's what embeddings buy you: matching on usage context, not vocabulary.

Does "close in vector space" always mean "answers the question"?

No — and knowing where it breaks is the bedrock. A chunk can be about the topic without containing the answer:

Q: "What's the return window?"
near-but-wrong: "Our return policy reflects our commitment to customers."
                ^ topically similar, contains no answer

Similarity tracks aboutness, not answer-bearing. The practical hedge is top-k: retrieve several candidates, not one, so the answer-bearing chunk is likely in the set even when the closest one isn't it.

hits = index.search(query, k=5)   # not k=1 — buy yourself margin
prompt = context(hits) + question

Why chunk documents instead of embedding the whole thing?

Not because you can't embed a whole document — you can, until you hit the embedder's token limit. The deeper reason survives even if input length were infinite: semantic dilution. One vector averaging 50 pages points everywhere and nowhere — it represents the mean meaning and matches nothing sharply.

1 vector for 50 pages -> blurred average -> weak match to any specific query
many vectors, 1 per passage -> each points somewhere precise -> sharp matches

Chunk so each vector means one thing.

Why not skip retrieval and put everything in a huge context window?

Two limits. Cost scales with tokens, so stuffing everything is expensive per query. And the window is finite — past some size, old content falls out. Selection isn't a workaround for small windows; it's cheaper and sharper even when the window is large.

The first principles it rests on

A language model answers from frozen weights; it cannot natively know private, recent, or unmemorized facts.
Grounded and fabricated outputs are indistinguishable from the text alone — the model is equally fluent either way.
RAG biases toward supplied context; it does not guarantee the model uses it.
Retrieval = returning relevant external text at query time. Embeddings are optional; keyword search counts.
Embeddings match on usage context, letting a question find an answer it shares no words with.
Vector similarity tracks aboutness, not answer-bearing relevance — hence top-k, not top-1.
Chunk to avoid semantic dilution: one vector should mean one thing.
Retrieval beats fine-tuning when data changes or is private (update a row, not the weights); fine-tuning beats retrieval for behavior and style.
Token cost and finite context make selection cheaper and sharper than feeding everything.