All writing

RAG Implementation

RAG Implementation Playbook — A Decision Path, Not a Tutorial

One line: Build retrieval as hybrid (dense + lexical) → fuse → rerank → generate, prove every choice on a 50–200 pair labeled set, and treat the latency budget as the constraint that vetoes half the "advanced" tricks.

What this build optimizes for: correctness you can measure, under a tight (<300 ms) latency budget, on a mixed corpus (structured PDFs + chat logs + DB rows), near-real-time freshness, API calls permitted with a DPA. Where your goals differ, each section has a decision table so you can re-pick.

Model and library names below are accurate as of June 2026 and move quarterly. Wrap every model behind one interface (see §3) so swapping is a config change, not a refactor.


The binding constraint: read this first

Most RAG tutorials ignore latency until production. With a <300 ms end-to-end budget, the math vetoes choices before you make them. A realistic hot-path breakdown:

Stage Cost Notes
Embed the query (API) 50–120 ms One short query, but it's a network round-trip. The hidden tax.
Vector search (HNSW) 5–25 ms Scales with ef_search and corpus size
BM25 / lexical search 5–20 ms Run in parallel with vector
RRF fusion <1 ms Negligible
Rerank (cross-encoder, top-20) 80–150 ms Cohere Rerank 4 p50; fewer candidates = less time
Subtotal before generation ~140–315 ms Already at/over budget
LLM generation 200 ms–2 s The elephant; usually streamed

Conclusion that shapes the whole build: you cannot have an API query-embedding and a cross-encoder rerank and the LLM all inside 300 ms. Something gives. The three escape hatches:

  1. Self-host query embedding (BGE-M3 on a warm GPU: ~5–15 ms vs 50–120 ms API). Biggest single win.
  2. Rerank fewer candidates (top-20, not top-100) or use a lite reranker (Voyage rerank-2.5-lite).
  3. Define the budget as retrieval-only (exclude streamed generation). Most teams quietly do this — be explicit about it.

If "<300 ms" includes generation, you're effectively choosing #1 + #2. The playbook assumes that.


§1 — Chunking: how do you cut documents into retrievable units?

Q: One strategy for everything? A: No — strategy follows source type. Fixed-size token splitting is the dumb baseline that forces you to paper over lost context with overlap. Match the cut to the structure.

Source Strategy Size / overlap Why
Structured PDF Recursive/heading-aware + contextual retrieval section or 512 tok / 0% Respects boundaries; context blurb restores what splitting destroys
Chat logs Window by turn or session per-turn / 0% Token-splitting severs question from answer
DB rows One row = one chunk, templated to a sentence 1 row / n/a Rows are already atomic; never token-split them
Unstructured prose Semantic (embedding-boundary) ~512 tok / 0–10% Splits at topic shifts, not arbitrary token counts

Contextual Retrieval (Anthropic, the highest-leverage move for PDFs): prepend an LLM-generated 1–2 sentence "where this chunk sits" blurb to each chunk before embedding and BM25-indexing. Measured: contextual embeddings alone cut failed retrievals ~35%; with contextual BM25, ~49%; with reranking on top, ~67% (5.7% → 1.9% failure rate). One-time cost with prompt caching ≈ $1.02 per million document tokens.

# Painful way: fixed 512-token windows, 50-token overlap to "preserve context"
# -> duplicated tokens, duplicate hits, and a chunk that still says
#    "revenue grew 3%" with no idea whose revenue or which quarter.
chunks = [text[i:i+512] for i in range(0, len(text), 512-50)]

# Clean way: structural split + a context blurb generated once per chunk.
context = llm(f"Give a 1-sentence locator for this chunk within the doc:\n{doc}\n\nChunk:\n{chunk}")
embeddable = f"{context}\n\n{chunk}"   # embed AND bm25-index this enriched string

Overlap is a band-aid, not a default. It only helps fixed-size splitting. With heading-aware or contextual chunking, 0% overlap is correct — extra overlap just inflates your index and creates duplicate retrievals.


§2 — Chunk size: 256, 512, or 1024 tokens?

Q: Which size? A: EMPIRICAL — it trades precision vs context and only your eval set decides. Default 512 / 10% for prose, then sweep.

  • Small (256): sharp embeddings, but a chunk may omit the context that answers the question. More chunks → bigger index, more embedding spend.
  • Large (1024): carries context, but dilutes the embedding (one vector for many ideas) and triggers "lost in the middle" when stuffed into the LLM.
  • Cost coupling: halving chunk size ≈ doubles vector count ≈ doubles index RAM and embed cost.

Sweep {256, 512, 1024} and read nDCG@10 and recall@5 off the harness in §11. Don't guess; the right size is corpus-specific.


§3 — Embedding model: API or self-host?

Q: Which model, given API is allowed? A: API is permitted (your fork A), so start with a strong API model for quality and zero ops. But your <300 ms budget makes query-side embedding latency a real cost — consider a small self-hosted model for queries even if you embed documents via API.

Model Type Strengths $/M tokens Pick when
OpenAI text-embedding-3-large API easiest, stable, strong ~$0.13 Default first build
Gemini Embedding API tops MTEB-v2 English (~68.3), Matryoshka ~$0.15 Want top quality + dim truncation
Cohere embed-v4 API multilingual, 128K context, multimodal ~$0.12 Long docs / many languages
OpenAI text-embedding-3-small API best value $0.02 Cost-sensitive, decent quality
BGE-M3 OSS MIT; emits dense + sparse + ColBERT in one pass (free hybrid); fast self-host $0 + GPU Privacy, or low-latency query embedding
Qwen3-Embedding-8B OSS #1 MMTEB multilingual (~70.6), Apache-2.0, 32K ctx $0 + GPU Multilingual, self-host quality ceiling

The lock-in to respect: changing embedding model means re-embedding the entire corpus. Abstract it from day one:

class Embedder:                       # swap implementations, never call sites
    def embed_docs(self, texts: list[str]) -> list[list[float]]: ...
    def embed_query(self, text: str)  -> list[float]: ...
# Reference build: docs via OpenAI 3-large (quality), queries via local BGE-M3 (latency).

Pros/cons by axis: API = linear $/token, 50–200 ms/call, no ops, but data leaves your VPC and you re-pay on every re-embed. Self-host = amortized GPU cost, batchable, in-VPC, but you run the GPU and own the uptime.


§4 — Dimension & quantization: how big is each vector?

Q: Full dimension or truncated? Quantize? A: TRADEOFF you size with arithmetic, plus an EMPIRICAL recall check on quantization.

chunks  ≈ (corpus_tokens / chunk_size) × (1 + overlap)
RAM_HNSW ≈ chunks × dim × 4 bytes × ~1.8     # float32 + graph overhead
# 1M chunks × 1536-d × 4B × 1.8 ≈ 11 GB; at 3072-d ≈ 22 GB

Two levers, each with a recall cost you must measure:

Lever Effect Recall cost Use when
Matryoshka truncation (3072 → 1024 / 768) storage ↓ proportional graceful, small Model supports it (Gemini, Voyage 4, Cohere v4, OpenAI 3-*)
int8 quantization ~4× smaller usually <2 nDCG pts >1M vectors
binary quantization ~32× smaller large; needs rerank-rescoring huge corpora, RAM-bound

Production bands: small <100K, mid 1–10M, large 50M+. Default: full float32 under ~500K chunks; 1024-d Matryoshka + int8 above ~1M. Then verify recall@10 didn't drop more than ~2 points (§13).


§5 — Index type: how is the search structured?

Q: Exact or approximate? A: SETTLED — HNSW for production. Keep a flat index on a sample to measure how much recall your approximation gives up.

Index Recall Latency Memory Use when
Flat (brute force) 100% O(N), slow low <~100K vectors; ground truth for measuring ANN recall
HNSW (graph) ~99% ~4–25 ms high (3–4× IVFFlat) default for online RAG
IVF / IVFPQ ~95–98% fast low (PQ compresses) RAM-constrained
DiskANN high higher disk-resident billion-scale, small RAM

HNSW knobs: M (graph degree, 16–64) and ef_construction (build quality, 100–400) set at build time; ef_search is your runtime recall/latency dial — raise for recall, lower for speed. Under <300 ms, ef_search is where you buy back milliseconds.


§6 — Vector store: where do the vectors live?

Q: Greenfield — which store? A: TRADEOFF of ops-simplicity vs latency-at-scale. With near-real-time freshness and tight latency, a dedicated engine earns its keep; pgvector wins on simplicity if you'll live in Postgres anyway.

Store Profile Latency Hybrid Watch-out
pgvector Postgres extension; free joins/txns/RBAC p95 ~80–140 ms at 5M manual (compose w/ Postgres FTS) slows past ~10M vectors
Qdrant Rust, self-host or cloud ~4 ms p50 native + quantization smaller ecosystem
Weaviate hybrid is first-class (one query, alpha knob) competitive native + rerank modules heavier than pgvector
Milvus billion-scale, GPU, most index options ~6 ms p50 native heaviest ops
Pinecone fully managed, zero ops <30 ms at scale added bill grows; data leaves VPC

Decision rule: under 10M vectors and you want one system → pgvector. Tight latency + near-real-time + open to dedicated (your case) → Qdrant (4 ms, native hybrid, native quantization) is the strongest reference pick. Want zero ops and API is fine → Pinecone.


§7 — Retrieval mode: dense, lexical, or both?

Q: Dense-only or hybrid? A: SETTLED — hybrid (dense + BM25) fused with RRF, because of your corpus. DB rows carry IDs/codes and PDFs carry part numbers and citations; dense embeddings fumble exact tokens, BM25 nails them. This isn't fashion — it's your data.

# RRF: robust, tuning-free fusion of dense + lexical rankings.
def rrf(rankings, k=60):              # k=60 is the standard constant
    scores = {}
    for ranked in rankings:          # [dense_ids, bm25_ids]
        for rank, doc_id in enumerate(ranked):
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
    return sorted(scores, key=scores.get, reverse=True)
Mode Catches Misses Use when
Dense-only paraphrase, meaning exact codes, SKUs, rare terms corpus is pure prose
Lexical-only (BM25) exact tokens paraphrase, synonyms keyword-heavy, no semantics
Hybrid + RRF both little anything with identifiers — your build

Tip: BGE-M3 (§3) emits dense and sparse vectors from one model, giving you hybrid without a second system.


§8 — Reranking: do you re-sort the candidates?

Q: Add a reranker? A: SETTLED yes — but the law of rerankers governs how: a reranker fixes ORDER, never RECALL. In benchmarks, no reranker pushed Hit@10 above ~88% because the missing 12% never entered the top-100 candidates. So reranking pays off only when recall is already high and precision is low — invest in the retriever first.

Pattern: retrieve top-20 (hybrid) → cross-encoder rerank → keep top-5 → LLM. Under <300 ms, rerank top-20, not top-100, and pick for speed.

Reranker License Latency add Pick when
Cohere Rerank 4 / 4 Pro API ~80–150 ms p50 strongest managed default
Voyage rerank-2.5-lite API ~half of full tight latency budget — your case
BGE-reranker-v2-m3 Apache-2.0 infra-amortized self-host / privacy
Zerank-2 CC-BY-NC blocks commercial use — avoid in products
# Retrieve wide-ish, rerank narrow, keep tight. Always wrap with a fallback.
candidates = hybrid_search(query, top_n=20)
try:
    ranked = rerank(query, candidates, model="rerank-2.5-lite")[:5]
except (TimeoutError, APIError):
    ranked = candidates[:5]          # degrade to retrieval order, never 500

The eval triangle (score with reranker on vs off, same queries): nDCG@k after rerank (did precision rise?), recall@k preserved (rerank can't add recall — confirm it didn't reorder good chunks out of top-k), and p95/p99 latency (did the cost pay for itself?).


§9 — Query transformation: rewrite, HyDE, multi-query?

Q: Add query transforms? A: EMPIRICAL, default OFF — and under <300 ms, an extra LLM call before retrieval blows the budget outright. This is the field's #1 cargo-cult trap: bolting on HyDE and multi-query because "advanced RAG" diagrams show them, with zero measured lift, often hurting keyword queries.

Transform What Cost Worth it when
None use the query as-is 0 well-formed queries (default)
History-aware rewrite resolve "what about its price?" against prior turns 1 LLM call chat path — the one near-certain win
HyDE embed a hypothetical answer 1 LLM call sparse corpora, exploratory queries; can hurt factual/keyword
Multi-query fan out N variants, union N embeds recall-starved, latency-tolerant
Decomposition split multi-hop into sub-questions several calls genuine multi-hop QA

Reference decision: ship without transforms. Add only history-aware rewriting on the chat path, and accept it's incompatible with <300 ms unless you use a tiny fast rewrite model. Measure before adding anything else.


§10 — Freshness: how do new documents enter the index?

Q: Batch, incremental, or streaming? A: SETTLED for you — near-real-time (seconds) → incremental/streaming upsert, per source, because your sources change at different rates.

Pattern Staleness Complexity Use when
Batch rebuild hours–days low static corpora, nightly is fine
Incremental upsert seconds–minutes medium changed docs only, keyed on content hash / updated_at
Streaming seconds high queue → embed → upsert pipeline
# Re-embed only what changed; idempotent on a stable doc id + content hash.
h = sha256(chunk_text).hexdigest()
if store.get_hash(doc_id) != h:
    store.upsert(doc_id, embed(chunk_text), meta={"hash": h, "ts": now()})

Per-source policy: PDFs rarely change (incremental on file hash); DB rows change often (CDC / updated_at); chat logs append (stream new turns).


§11 — Evaluation: the harness that makes every other choice checkable

Q: How do I know any of this works? A: SETTLED — you can't, without a small labeled set. This is the keystone: every EMPIRICAL branch above is unresolvable until this exists. Build 50–200 (query → relevant chunk_ids) pairs.

Two ways to get the set:

  1. Real queries — collect 50 representative questions, run retrieval, hand-label which returned chunks are actually relevant.
  2. Synthetic — for each of 50 chunks, ask an LLM "what question does this chunk answer?"; the source chunk is the gold label. Fast, slightly optimistic, good enough to start.
# Minimal retrieval-eval harness. No framework needed.
import math

labeled = [
    {"q": "What is the refund window?", "relevant": {"chunk_42"}},
    # ... 50-200 of these ...
]

def recall_at_k(retrieved, relevant, k):
    return len(set(retrieved[:k]) & relevant) / max(1, len(relevant))

def mrr(retrieved, relevant):
    for i, c in enumerate(retrieved, 1):
        if c in relevant:
            return 1 / i
    return 0.0

def ndcg_at_k(retrieved, relevant, k):
    dcg = sum(1 / math.log2(i + 1) for i, c in enumerate(retrieved[:k], 1) if c in relevant)
    idcg = sum(1 / math.log2(i + 1) for i in range(1, min(len(relevant), k) + 1))
    return dcg / idcg if idcg else 0.0

def evaluate(pipeline, dataset, k=5):
    R = N = M = 0
    for ex in dataset:
        got = pipeline(ex["q"])              # returns ranked chunk_ids
        R += recall_at_k(got, ex["relevant"], k)
        N += ndcg_at_k(got, ex["relevant"], k)
        M += mrr(got, ex["relevant"])
    n = len(dataset)
    return {"recall@%d" % k: R/n, "nDCG@%d" % k: N/n, "MRR": M/n}

# Compare any two configs by swapping `pipeline`:
print(evaluate(hybrid_only, labeled))
print(evaluate(hybrid_plus_rerank, labeled))   # rerank should lift nDCG, hold recall

Metric meanings: recall@k = did the right chunk make the top-k (retriever's ceiling). nDCG@k = is it ranked high (what reranking improves). MRR = rank of the first hit (matters for chatbots that act on the top result). For generation quality, add faithfulness (answer grounded in retrieved context, no hallucination) and answer-correctness via RAGAS or an LLM-judge — slower, needs the full pipeline.


§12 — Reference architecture (your fork answers, wired together)

Stack: Qdrant (HNSW + native hybrid + int8 quant), OpenAI text-embedding-3-large for documents, BGE-M3 local for query embedding (latency), Voyage rerank-2.5-lite (fits the budget), RRF fusion, contextual retrieval on PDFs, incremental upsert on content hash.


§13 — What to measure on your own data (the EMPIRICAL experiments)

Each resolves a branch that no benchmark can decide for you. Run them against the §11 harness.

  1. Chunk size sweep — index at 256 / 512 / 1024; compare recall@5 and nDCG@10. Pick the knee, not the max.
  2. Contextual-retrieval ROI — index with and without the context blurb; if failed-retrieval rate doesn't drop meaningfully on your corpus, skip the LLM pass and its cost.
  3. Embedding bake-offtext-embedding-3-large vs Gemini vs BGE-M3 on the same labeled set; quality gaps are corpus-specific. Don't trust the leaderboard over your data.
  4. Quantization recall — measure recall@10 at float32 vs int8 vs binary; accept the smallest that stays within ~2 points.
  5. ef_search curve — plot recall vs latency as you raise ef_search; pick the point that hits your 300 ms budget at acceptable recall.
  6. Reranker on/off (eval triangle) — nDCG@5 lift, recall preservation, p95 latency. If nDCG doesn't move on your corpus, drop the reranker and reclaim ~100 ms.
  7. Query-rewrite lift (chat only) — recall@5 with vs without history-aware rewrite on real multi-turn queries.

§14 — Assumptions, and how each one breaks

Every recommendation above rests on an assumption. These are the ones that bite.

  • "Hybrid always helps." Breaks if your corpus is pure prose with no identifiers — BM25 then adds latency for ~no recall gain. Detector: run dense-only vs hybrid on the labeled set; if recall@5 is within noise, drop BM25.
  • "The reranker improves results." Breaks when retrieval recall is low — reranking can't surface what was never retrieved, and reorders a bad list into a differently bad list. Detector: if recall@20 is poor, fix the retriever before touching the reranker.
  • "Contextual retrieval is worth it." Breaks on corpora where chunks are already self-contained (e.g., well-formed DB rows) — you pay the LLM pass for little gain. Detector: experiment #2.
  • "<300 ms is achievable with all the bells on." Breaks the moment you add an API query-embed + cross-encoder + an LLM rewrite. Detector: the §0 budget table — sum your real p95s before committing.
  • "512 tokens is a good default." Breaks for Q&A where answers span sections (too small) or for dense tables (too large). Detector: the chunk-size sweep.
  • "int8 quantization is free." Breaks on corpora with many near-duplicate vectors, where small distance errors flip rankings. Detector: recall@10 delta after quantizing.
  • "The synthetic eval set is representative." Breaks because LLM-generated questions are cleaner and easier than real user queries — your offline numbers will be optimistic. Detector: once you have real traffic, re-label 50 real queries and compare.
  • "API embeddings stay cheap." Breaks at scale (cost is linear in tokens) and on every model upgrade (you re-embed the whole corpus). Detector: project monthly token volume × price before choosing API over self-host.
  • "Model names in this doc are current." Breaks every quarter. Detector: re-check MTEB-retrieval and a neutral reranker leaderboard before each new build; keep the swap behind the §3 interface.

Build order: get hybrid + RRF + a labeled set working first (that's 80% of the quality). Add contextual retrieval and reranking second, measured. Add query transforms last, only if the eval says so.


§15 — Glossary (every short form used above)

Core & general

  • RAG — Retrieval-Augmented Generation
  • LLM — Large Language Model
  • API — Application Programming Interface
  • DPA — Data Processing Agreement (contract letting a vendor process your data)
  • VPC — Virtual Private Cloud (your isolated cloud network)
  • OSS — Open-Source Software
  • PDF — Portable Document Format
  • DB — Database
  • ID — Identifier; SKU — Stock Keeping Unit (a product code)

Retrieval & ranking

  • BM25 — Best Matching 25 (the Okapi BM25 lexical ranking function; scores keyword overlap)
  • RRF — Reciprocal Rank Fusion (merges several ranked lists; k=60 is the standard smoothing constant)
  • ANN — Approximate Nearest Neighbor (fast, slightly-inexact vector search)
  • HyDE — Hypothetical Document Embeddings (embed an LLM-generated hypothetical answer instead of the raw query)
  • ColBERT — Contextualized Late Interaction over BERT (token-level "late interaction" matching)
  • BERT — Bidirectional Encoder Representations from Transformers (the underlying encoder family)
  • top-k / top-20 / top-5 — the k highest-scoring results kept at that stage

Index & storage

  • HNSW — Hierarchical Navigable Small World (graph-based ANN index; the production default)
    • M — graph degree (neighbors per node, build-time)
    • ef_construction — build-time search width (sets index quality)
    • ef_search — query-time search width (your runtime recall/latency dial)
  • IVF — Inverted File (clustering-based ANN index); IVFFlat — IVF storing full vectors
  • PQ — Product Quantization (compresses vectors); IVFPQ — IVF + Product Quantization
  • DiskANN — Disk-based Approximate Nearest Neighbor (graph index kept on disk for huge corpora)
  • RAM — Random-Access Memory; GPU — Graphics Processing Unit
  • RBAC — Role-Based Access Control
  • FTS — Full-Text Search (Postgres' built-in lexical search)
  • CDC — Change Data Capture (detecting row changes in a database)
  • float32 / int8 / binary — 32-bit float / 8-bit integer / 1-bit-per-dimension vector storage (smaller = cheaper, lossier)
  • MRL — Matryoshka Representation Learning (truncate a long vector to fewer dimensions with graceful, not cliff-edge, quality loss)
  • sha256 — Secure Hash Algorithm, 256-bit (used here to detect changed content)

Metrics & evaluation

  • recall@k — fraction of relevant items that appear in the top-k results (the retriever's ceiling)
  • nDCG — normalized Discounted Cumulative Gain (rank quality; rewards relevant items near the top)
    • DCG — Discounted Cumulative Gain; IDCG — Ideal DCG (best possible, used to normalize)
  • MRR — Mean Reciprocal Rank (1 / rank of the first relevant item, averaged over queries)
  • p50 / p95 / p99 — 50th / 95th / 99th percentile latency (p95 = 95% of requests are faster than this)
  • RAGAS — Retrieval-Augmented Generation Assessment (eval library for faithfulness, answer correctness, etc.)

Benchmarks

  • MTEB — Massive Text Embedding Benchmark
  • MMTEB — Massive Multilingual Text Embedding Benchmark

Licenses

  • MIT — MIT License (permissive; commercial use allowed)
  • Apache-2.0 — Apache License 2.0 (permissive + patent grant; commercial use allowed)
  • CC-BY-NC — Creative Commons Attribution-NonCommercial (no commercial use — a trap for products)

Tools & models named

  • pgvector — vector-search extension for PostgreSQL
  • Qdrant / Weaviate / Milvus / Pinecone — dedicated vector databases
  • BGE-M3 — BAAI General Embedding; M3 = Multi-lingual, Multi-functional (dense + sparse + ColBERT), Multi-granularity. BAAI = Beijing Academy of Artificial Intelligence
  • Qwen3-Embedding — Alibaba's open-weight embedding model family
  • OpenAI text-embedding-3-large / -small — OpenAI embedding models
  • Gemini Embedding — Google's embedding model
  • Cohere embed-v4 / Cohere Rerank 4 — Cohere's embedding / reranking models
  • Voyage rerank-2.5 (-lite) — Voyage AI's reranker (lite = faster, slightly lower quality)
  • Unitsms = milliseconds; tok = tokens; $/M = US dollars per million tokens