RAG Implementation
RAG Implementation Playbook — A Decision Path, Not a Tutorial
One line: Build retrieval as hybrid (dense + lexical) → fuse → rerank → generate, prove every choice on a 50–200 pair labeled set, and treat the latency budget as the constraint that vetoes half the "advanced" tricks.
What this build optimizes for: correctness you can measure, under a tight (<300 ms) latency budget, on a mixed corpus (structured PDFs + chat logs + DB rows), near-real-time freshness, API calls permitted with a DPA. Where your goals differ, each section has a decision table so you can re-pick.
Model and library names below are accurate as of June 2026 and move quarterly. Wrap every model behind one interface (see §3) so swapping is a config change, not a refactor.
The binding constraint: read this first
Most RAG tutorials ignore latency until production. With a <300 ms end-to-end budget, the math vetoes choices before you make them. A realistic hot-path breakdown:
| Stage | Cost | Notes |
|---|---|---|
| Embed the query (API) | 50–120 ms | One short query, but it's a network round-trip. The hidden tax. |
| Vector search (HNSW) | 5–25 ms | Scales with ef_search and corpus size |
| BM25 / lexical search | 5–20 ms | Run in parallel with vector |
| RRF fusion | <1 ms | Negligible |
| Rerank (cross-encoder, top-20) | 80–150 ms | Cohere Rerank 4 p50; fewer candidates = less time |
| Subtotal before generation | ~140–315 ms | Already at/over budget |
| LLM generation | 200 ms–2 s | The elephant; usually streamed |
Conclusion that shapes the whole build: you cannot have an API query-embedding and a cross-encoder rerank and the LLM all inside 300 ms. Something gives. The three escape hatches:
- Self-host query embedding (BGE-M3 on a warm GPU: ~5–15 ms vs 50–120 ms API). Biggest single win.
- Rerank fewer candidates (top-20, not top-100) or use a lite reranker (Voyage rerank-2.5-lite).
- Define the budget as retrieval-only (exclude streamed generation). Most teams quietly do this — be explicit about it.
If "<300 ms" includes generation, you're effectively choosing #1 + #2. The playbook assumes that.
§1 — Chunking: how do you cut documents into retrievable units?
Q: One strategy for everything? A: No — strategy follows source type. Fixed-size token splitting is the dumb baseline that forces you to paper over lost context with overlap. Match the cut to the structure.
| Source | Strategy | Size / overlap | Why |
|---|---|---|---|
| Structured PDF | Recursive/heading-aware + contextual retrieval | section or 512 tok / 0% | Respects boundaries; context blurb restores what splitting destroys |
| Chat logs | Window by turn or session | per-turn / 0% | Token-splitting severs question from answer |
| DB rows | One row = one chunk, templated to a sentence | 1 row / n/a | Rows are already atomic; never token-split them |
| Unstructured prose | Semantic (embedding-boundary) | ~512 tok / 0–10% | Splits at topic shifts, not arbitrary token counts |
Contextual Retrieval (Anthropic, the highest-leverage move for PDFs): prepend an LLM-generated 1–2 sentence "where this chunk sits" blurb to each chunk before embedding and BM25-indexing. Measured: contextual embeddings alone cut failed retrievals ~35%; with contextual BM25, ~49%; with reranking on top, ~67% (5.7% → 1.9% failure rate). One-time cost with prompt caching ≈ $1.02 per million document tokens.
# Painful way: fixed 512-token windows, 50-token overlap to "preserve context"
# -> duplicated tokens, duplicate hits, and a chunk that still says
# "revenue grew 3%" with no idea whose revenue or which quarter.
chunks = [text[i:i+512] for i in range(0, len(text), 512-50)]
# Clean way: structural split + a context blurb generated once per chunk.
context = llm(f"Give a 1-sentence locator for this chunk within the doc:\n{doc}\n\nChunk:\n{chunk}")
embeddable = f"{context}\n\n{chunk}" # embed AND bm25-index this enriched string
Overlap is a band-aid, not a default. It only helps fixed-size splitting. With heading-aware or contextual chunking, 0% overlap is correct — extra overlap just inflates your index and creates duplicate retrievals.
§2 — Chunk size: 256, 512, or 1024 tokens?
Q: Which size? A: EMPIRICAL — it trades precision vs context and only your eval set decides. Default 512 / 10% for prose, then sweep.
- Small (256): sharp embeddings, but a chunk may omit the context that answers the question. More chunks → bigger index, more embedding spend.
- Large (1024): carries context, but dilutes the embedding (one vector for many ideas) and triggers "lost in the middle" when stuffed into the LLM.
- Cost coupling: halving chunk size ≈ doubles vector count ≈ doubles index RAM and embed cost.
Sweep {256, 512, 1024} and read nDCG@10 and recall@5 off the harness in §11. Don't guess; the right size is corpus-specific.
§3 — Embedding model: API or self-host?
Q: Which model, given API is allowed? A: API is permitted (your fork A), so start with a strong API model for quality and zero ops. But your <300 ms budget makes query-side embedding latency a real cost — consider a small self-hosted model for queries even if you embed documents via API.
| Model | Type | Strengths | $/M tokens | Pick when |
|---|---|---|---|---|
OpenAI text-embedding-3-large |
API | easiest, stable, strong | ~$0.13 | Default first build |
| Gemini Embedding | API | tops MTEB-v2 English (~68.3), Matryoshka | ~$0.15 | Want top quality + dim truncation |
Cohere embed-v4 |
API | multilingual, 128K context, multimodal | ~$0.12 | Long docs / many languages |
OpenAI text-embedding-3-small |
API | best value | $0.02 | Cost-sensitive, decent quality |
| BGE-M3 | OSS | MIT; emits dense + sparse + ColBERT in one pass (free hybrid); fast self-host | $0 + GPU | Privacy, or low-latency query embedding |
| Qwen3-Embedding-8B | OSS | #1 MMTEB multilingual (~70.6), Apache-2.0, 32K ctx | $0 + GPU | Multilingual, self-host quality ceiling |
The lock-in to respect: changing embedding model means re-embedding the entire corpus. Abstract it from day one:
class Embedder: # swap implementations, never call sites
def embed_docs(self, texts: list[str]) -> list[list[float]]: ...
def embed_query(self, text: str) -> list[float]: ...
# Reference build: docs via OpenAI 3-large (quality), queries via local BGE-M3 (latency).
Pros/cons by axis: API = linear $/token, 50–200 ms/call, no ops, but data leaves your VPC and you re-pay on every re-embed. Self-host = amortized GPU cost, batchable, in-VPC, but you run the GPU and own the uptime.
§4 — Dimension & quantization: how big is each vector?
Q: Full dimension or truncated? Quantize? A: TRADEOFF you size with arithmetic, plus an EMPIRICAL recall check on quantization.
chunks ≈ (corpus_tokens / chunk_size) × (1 + overlap)
RAM_HNSW ≈ chunks × dim × 4 bytes × ~1.8 # float32 + graph overhead
# 1M chunks × 1536-d × 4B × 1.8 ≈ 11 GB; at 3072-d ≈ 22 GB
Two levers, each with a recall cost you must measure:
| Lever | Effect | Recall cost | Use when |
|---|---|---|---|
| Matryoshka truncation (3072 → 1024 / 768) | storage ↓ proportional | graceful, small | Model supports it (Gemini, Voyage 4, Cohere v4, OpenAI 3-*) |
| int8 quantization | ~4× smaller | usually <2 nDCG pts | >1M vectors |
| binary quantization | ~32× smaller | large; needs rerank-rescoring | huge corpora, RAM-bound |
Production bands: small <100K, mid 1–10M, large 50M+. Default: full float32 under ~500K chunks; 1024-d Matryoshka + int8 above ~1M. Then verify recall@10 didn't drop more than ~2 points (§13).
§5 — Index type: how is the search structured?
Q: Exact or approximate? A: SETTLED — HNSW for production. Keep a flat index on a sample to measure how much recall your approximation gives up.
| Index | Recall | Latency | Memory | Use when |
|---|---|---|---|---|
| Flat (brute force) | 100% | O(N), slow | low | <~100K vectors; ground truth for measuring ANN recall |
| HNSW (graph) | ~99% | ~4–25 ms | high (3–4× IVFFlat) | default for online RAG |
| IVF / IVFPQ | ~95–98% | fast | low (PQ compresses) | RAM-constrained |
| DiskANN | high | higher | disk-resident | billion-scale, small RAM |
HNSW knobs: M (graph degree, 16–64) and ef_construction (build quality, 100–400) set at build time; ef_search is your runtime recall/latency dial — raise for recall, lower for speed. Under <300 ms, ef_search is where you buy back milliseconds.
§6 — Vector store: where do the vectors live?
Q: Greenfield — which store? A: TRADEOFF of ops-simplicity vs latency-at-scale. With near-real-time freshness and tight latency, a dedicated engine earns its keep; pgvector wins on simplicity if you'll live in Postgres anyway.
| Store | Profile | Latency | Hybrid | Watch-out |
|---|---|---|---|---|
| pgvector | Postgres extension; free joins/txns/RBAC | p95 ~80–140 ms at 5M | manual (compose w/ Postgres FTS) | slows past ~10M vectors |
| Qdrant | Rust, self-host or cloud | ~4 ms p50 | native + quantization | smaller ecosystem |
| Weaviate | hybrid is first-class (one query, alpha knob) |
competitive | native + rerank modules | heavier than pgvector |
| Milvus | billion-scale, GPU, most index options | ~6 ms p50 | native | heaviest ops |
| Pinecone | fully managed, zero ops | <30 ms at scale | added | bill grows; data leaves VPC |
Decision rule: under 10M vectors and you want one system → pgvector. Tight latency + near-real-time + open to dedicated (your case) → Qdrant (4 ms, native hybrid, native quantization) is the strongest reference pick. Want zero ops and API is fine → Pinecone.
§7 — Retrieval mode: dense, lexical, or both?
Q: Dense-only or hybrid? A: SETTLED — hybrid (dense + BM25) fused with RRF, because of your corpus. DB rows carry IDs/codes and PDFs carry part numbers and citations; dense embeddings fumble exact tokens, BM25 nails them. This isn't fashion — it's your data.
# RRF: robust, tuning-free fusion of dense + lexical rankings.
def rrf(rankings, k=60): # k=60 is the standard constant
scores = {}
for ranked in rankings: # [dense_ids, bm25_ids]
for rank, doc_id in enumerate(ranked):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
return sorted(scores, key=scores.get, reverse=True)
| Mode | Catches | Misses | Use when |
|---|---|---|---|
| Dense-only | paraphrase, meaning | exact codes, SKUs, rare terms | corpus is pure prose |
| Lexical-only (BM25) | exact tokens | paraphrase, synonyms | keyword-heavy, no semantics |
| Hybrid + RRF | both | little | anything with identifiers — your build |
Tip: BGE-M3 (§3) emits dense and sparse vectors from one model, giving you hybrid without a second system.
§8 — Reranking: do you re-sort the candidates?
Q: Add a reranker? A: SETTLED yes — but the law of rerankers governs how: a reranker fixes ORDER, never RECALL. In benchmarks, no reranker pushed Hit@10 above ~88% because the missing 12% never entered the top-100 candidates. So reranking pays off only when recall is already high and precision is low — invest in the retriever first.
Pattern: retrieve top-20 (hybrid) → cross-encoder rerank → keep top-5 → LLM. Under <300 ms, rerank top-20, not top-100, and pick for speed.
| Reranker | License | Latency add | Pick when |
|---|---|---|---|
| Cohere Rerank 4 / 4 Pro | API | ~80–150 ms p50 | strongest managed default |
| Voyage rerank-2.5-lite | API | ~half of full | tight latency budget — your case |
| BGE-reranker-v2-m3 | Apache-2.0 | infra-amortized | self-host / privacy |
| Zerank-2 | CC-BY-NC | — | blocks commercial use — avoid in products |
# Retrieve wide-ish, rerank narrow, keep tight. Always wrap with a fallback.
candidates = hybrid_search(query, top_n=20)
try:
ranked = rerank(query, candidates, model="rerank-2.5-lite")[:5]
except (TimeoutError, APIError):
ranked = candidates[:5] # degrade to retrieval order, never 500
The eval triangle (score with reranker on vs off, same queries): nDCG@k after rerank (did precision rise?), recall@k preserved (rerank can't add recall — confirm it didn't reorder good chunks out of top-k), and p95/p99 latency (did the cost pay for itself?).
§9 — Query transformation: rewrite, HyDE, multi-query?
Q: Add query transforms? A: EMPIRICAL, default OFF — and under <300 ms, an extra LLM call before retrieval blows the budget outright. This is the field's #1 cargo-cult trap: bolting on HyDE and multi-query because "advanced RAG" diagrams show them, with zero measured lift, often hurting keyword queries.
| Transform | What | Cost | Worth it when |
|---|---|---|---|
| None | use the query as-is | 0 | well-formed queries (default) |
| History-aware rewrite | resolve "what about its price?" against prior turns | 1 LLM call | chat path — the one near-certain win |
| HyDE | embed a hypothetical answer | 1 LLM call | sparse corpora, exploratory queries; can hurt factual/keyword |
| Multi-query | fan out N variants, union | N embeds | recall-starved, latency-tolerant |
| Decomposition | split multi-hop into sub-questions | several calls | genuine multi-hop QA |
Reference decision: ship without transforms. Add only history-aware rewriting on the chat path, and accept it's incompatible with <300 ms unless you use a tiny fast rewrite model. Measure before adding anything else.
§10 — Freshness: how do new documents enter the index?
Q: Batch, incremental, or streaming? A: SETTLED for you — near-real-time (seconds) → incremental/streaming upsert, per source, because your sources change at different rates.
| Pattern | Staleness | Complexity | Use when |
|---|---|---|---|
| Batch rebuild | hours–days | low | static corpora, nightly is fine |
| Incremental upsert | seconds–minutes | medium | changed docs only, keyed on content hash / updated_at |
| Streaming | seconds | high | queue → embed → upsert pipeline |
# Re-embed only what changed; idempotent on a stable doc id + content hash.
h = sha256(chunk_text).hexdigest()
if store.get_hash(doc_id) != h:
store.upsert(doc_id, embed(chunk_text), meta={"hash": h, "ts": now()})
Per-source policy: PDFs rarely change (incremental on file hash); DB rows change often (CDC / updated_at); chat logs append (stream new turns).
§11 — Evaluation: the harness that makes every other choice checkable
Q: How do I know any of this works?
A: SETTLED — you can't, without a small labeled set. This is the keystone: every EMPIRICAL branch above is unresolvable until this exists. Build 50–200 (query → relevant chunk_ids) pairs.
Two ways to get the set:
- Real queries — collect 50 representative questions, run retrieval, hand-label which returned chunks are actually relevant.
- Synthetic — for each of 50 chunks, ask an LLM "what question does this chunk answer?"; the source chunk is the gold label. Fast, slightly optimistic, good enough to start.
# Minimal retrieval-eval harness. No framework needed.
import math
labeled = [
{"q": "What is the refund window?", "relevant": {"chunk_42"}},
# ... 50-200 of these ...
]
def recall_at_k(retrieved, relevant, k):
return len(set(retrieved[:k]) & relevant) / max(1, len(relevant))
def mrr(retrieved, relevant):
for i, c in enumerate(retrieved, 1):
if c in relevant:
return 1 / i
return 0.0
def ndcg_at_k(retrieved, relevant, k):
dcg = sum(1 / math.log2(i + 1) for i, c in enumerate(retrieved[:k], 1) if c in relevant)
idcg = sum(1 / math.log2(i + 1) for i in range(1, min(len(relevant), k) + 1))
return dcg / idcg if idcg else 0.0
def evaluate(pipeline, dataset, k=5):
R = N = M = 0
for ex in dataset:
got = pipeline(ex["q"]) # returns ranked chunk_ids
R += recall_at_k(got, ex["relevant"], k)
N += ndcg_at_k(got, ex["relevant"], k)
M += mrr(got, ex["relevant"])
n = len(dataset)
return {"recall@%d" % k: R/n, "nDCG@%d" % k: N/n, "MRR": M/n}
# Compare any two configs by swapping `pipeline`:
print(evaluate(hybrid_only, labeled))
print(evaluate(hybrid_plus_rerank, labeled)) # rerank should lift nDCG, hold recall
Metric meanings: recall@k = did the right chunk make the top-k (retriever's ceiling). nDCG@k = is it ranked high (what reranking improves). MRR = rank of the first hit (matters for chatbots that act on the top result). For generation quality, add faithfulness (answer grounded in retrieved context, no hallucination) and answer-correctness via RAGAS or an LLM-judge — slower, needs the full pipeline.
§12 — Reference architecture (your fork answers, wired together)
Stack: Qdrant (HNSW + native hybrid + int8 quant), OpenAI text-embedding-3-large for documents, BGE-M3 local for query embedding (latency), Voyage rerank-2.5-lite (fits the budget), RRF fusion, contextual retrieval on PDFs, incremental upsert on content hash.
§13 — What to measure on your own data (the EMPIRICAL experiments)
Each resolves a branch that no benchmark can decide for you. Run them against the §11 harness.
- Chunk size sweep — index at 256 / 512 / 1024; compare recall@5 and nDCG@10. Pick the knee, not the max.
- Contextual-retrieval ROI — index with and without the context blurb; if failed-retrieval rate doesn't drop meaningfully on your corpus, skip the LLM pass and its cost.
- Embedding bake-off —
text-embedding-3-largevs Gemini vs BGE-M3 on the same labeled set; quality gaps are corpus-specific. Don't trust the leaderboard over your data. - Quantization recall — measure recall@10 at float32 vs int8 vs binary; accept the smallest that stays within ~2 points.
ef_searchcurve — plot recall vs latency as you raiseef_search; pick the point that hits your 300 ms budget at acceptable recall.- Reranker on/off (eval triangle) — nDCG@5 lift, recall preservation, p95 latency. If nDCG doesn't move on your corpus, drop the reranker and reclaim ~100 ms.
- Query-rewrite lift (chat only) — recall@5 with vs without history-aware rewrite on real multi-turn queries.
§14 — Assumptions, and how each one breaks
Every recommendation above rests on an assumption. These are the ones that bite.
- "Hybrid always helps." Breaks if your corpus is pure prose with no identifiers — BM25 then adds latency for ~no recall gain. Detector: run dense-only vs hybrid on the labeled set; if recall@5 is within noise, drop BM25.
- "The reranker improves results." Breaks when retrieval recall is low — reranking can't surface what was never retrieved, and reorders a bad list into a differently bad list. Detector: if recall@20 is poor, fix the retriever before touching the reranker.
- "Contextual retrieval is worth it." Breaks on corpora where chunks are already self-contained (e.g., well-formed DB rows) — you pay the LLM pass for little gain. Detector: experiment #2.
- "<300 ms is achievable with all the bells on." Breaks the moment you add an API query-embed + cross-encoder + an LLM rewrite. Detector: the §0 budget table — sum your real p95s before committing.
- "512 tokens is a good default." Breaks for Q&A where answers span sections (too small) or for dense tables (too large). Detector: the chunk-size sweep.
- "int8 quantization is free." Breaks on corpora with many near-duplicate vectors, where small distance errors flip rankings. Detector: recall@10 delta after quantizing.
- "The synthetic eval set is representative." Breaks because LLM-generated questions are cleaner and easier than real user queries — your offline numbers will be optimistic. Detector: once you have real traffic, re-label 50 real queries and compare.
- "API embeddings stay cheap." Breaks at scale (cost is linear in tokens) and on every model upgrade (you re-embed the whole corpus). Detector: project monthly token volume × price before choosing API over self-host.
- "Model names in this doc are current." Breaks every quarter. Detector: re-check MTEB-retrieval and a neutral reranker leaderboard before each new build; keep the swap behind the §3 interface.
Build order: get hybrid + RRF + a labeled set working first (that's 80% of the quality). Add contextual retrieval and reranking second, measured. Add query transforms last, only if the eval says so.
§15 — Glossary (every short form used above)
Core & general
- RAG — Retrieval-Augmented Generation
- LLM — Large Language Model
- API — Application Programming Interface
- DPA — Data Processing Agreement (contract letting a vendor process your data)
- VPC — Virtual Private Cloud (your isolated cloud network)
- OSS — Open-Source Software
- PDF — Portable Document Format
- DB — Database
- ID — Identifier; SKU — Stock Keeping Unit (a product code)
Retrieval & ranking
- BM25 — Best Matching 25 (the Okapi BM25 lexical ranking function; scores keyword overlap)
- RRF — Reciprocal Rank Fusion (merges several ranked lists;
k=60is the standard smoothing constant) - ANN — Approximate Nearest Neighbor (fast, slightly-inexact vector search)
- HyDE — Hypothetical Document Embeddings (embed an LLM-generated hypothetical answer instead of the raw query)
- ColBERT — Contextualized Late Interaction over BERT (token-level "late interaction" matching)
- BERT — Bidirectional Encoder Representations from Transformers (the underlying encoder family)
- top-k / top-20 / top-5 — the k highest-scoring results kept at that stage
Index & storage
- HNSW — Hierarchical Navigable Small World (graph-based ANN index; the production default)
- M — graph degree (neighbors per node, build-time)
- ef_construction — build-time search width (sets index quality)
- ef_search — query-time search width (your runtime recall/latency dial)
- IVF — Inverted File (clustering-based ANN index); IVFFlat — IVF storing full vectors
- PQ — Product Quantization (compresses vectors); IVFPQ — IVF + Product Quantization
- DiskANN — Disk-based Approximate Nearest Neighbor (graph index kept on disk for huge corpora)
- RAM — Random-Access Memory; GPU — Graphics Processing Unit
- RBAC — Role-Based Access Control
- FTS — Full-Text Search (Postgres' built-in lexical search)
- CDC — Change Data Capture (detecting row changes in a database)
- float32 / int8 / binary — 32-bit float / 8-bit integer / 1-bit-per-dimension vector storage (smaller = cheaper, lossier)
- MRL — Matryoshka Representation Learning (truncate a long vector to fewer dimensions with graceful, not cliff-edge, quality loss)
- sha256 — Secure Hash Algorithm, 256-bit (used here to detect changed content)
Metrics & evaluation
- recall@k — fraction of relevant items that appear in the top-k results (the retriever's ceiling)
- nDCG — normalized Discounted Cumulative Gain (rank quality; rewards relevant items near the top)
- DCG — Discounted Cumulative Gain; IDCG — Ideal DCG (best possible, used to normalize)
- MRR — Mean Reciprocal Rank (1 / rank of the first relevant item, averaged over queries)
- p50 / p95 / p99 — 50th / 95th / 99th percentile latency (p95 = 95% of requests are faster than this)
- RAGAS — Retrieval-Augmented Generation Assessment (eval library for faithfulness, answer correctness, etc.)
Benchmarks
- MTEB — Massive Text Embedding Benchmark
- MMTEB — Massive Multilingual Text Embedding Benchmark
Licenses
- MIT — MIT License (permissive; commercial use allowed)
- Apache-2.0 — Apache License 2.0 (permissive + patent grant; commercial use allowed)
- CC-BY-NC — Creative Commons Attribution-NonCommercial (no commercial use — a trap for products)
Tools & models named
- pgvector — vector-search extension for PostgreSQL
- Qdrant / Weaviate / Milvus / Pinecone — dedicated vector databases
- BGE-M3 — BAAI General Embedding; M3 = Multi-lingual, Multi-functional (dense + sparse + ColBERT), Multi-granularity. BAAI = Beijing Academy of Artificial Intelligence
- Qwen3-Embedding — Alibaba's open-weight embedding model family
- OpenAI text-embedding-3-large / -small — OpenAI embedding models
- Gemini Embedding — Google's embedding model
- Cohere embed-v4 / Cohere Rerank 4 — Cohere's embedding / reranking models
- Voyage rerank-2.5 (-lite) — Voyage AI's reranker (lite = faster, slightly lower quality)
- Units — ms = milliseconds; tok = tokens; $/M = US dollars per million tokens