Representation of Meaning: Chunking + Embedding in RAG
One-line summary: To retrieve text by meaning instead of keywords, we cut documents into chunks and turn each chunk into a vector whose position was deliberately trained so that "close in space" means "close in meaning."
Why this exists / what it solves: Keyword search matches strings. Ask it for "car" and it misses "automobile." A retrieval system needs to find text that means the same thing even when it says it differently — and it needs to find the relevant span, not dump a whole document. Chunking + embedding is the machinery that makes both possible. Everything below is the reasoning that machinery rests on, rebuilt as the chain of questions that actually forces each piece into existence.
What were we doing before embeddings?
Lexical search — match the query's words against the document's words (BM25, TF-IDF). It's fast and exact.
What problem did that cause?
It matches vocabulary, not meaning. Synonyms, paraphrases, and rewordings slip through.
# The lexical failure: same meaning, zero shared keywords → no match
query = "how to fix a flat tyre"
doc = "repairing a punctured wheel"
# BM25 sees no overlapping tokens → score ≈ 0, even though they're the same request.
We need a representation where "flat tyre" and "punctured wheel" land near each other. That representation is an embedding.
What exactly is "meaning" here?
Not meaning in the philosophical sense. Operationally, meaning = the distributional context a piece of text tends to appear in, as learned from the model's training data. That's the honest floor: the model knows nothing beyond the patterns it was trained on. We're not claiming to capture truth or intent — only learned context.
What is an embedding, concretely?
A fixed-length list of numbers (a vector) produced by a trained model. Same length for every input, regardless of input length.
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("all-MiniLM-L6-v2") # outputs 384-dim vectors
v = model.encode("repairing a punctured wheel", normalize_embeddings=True)
print(v.shape) # (384,) -- a paragraph and a word both become one fixed-size vector
Why should nearness in this number-space equal nearness in meaning?
This is the load-bearing claim, and it's not because of the distance metric. Cosine/dot-product is just a ruler — laying a ruler over random numbers tells you nothing.
It's true because the model was trained with a contrastive objective: the loss function itself is geometric. It pulls the vectors of related text pairs together and pushes unrelated pairs apart. Proximity isn't a happy accident we measure after the fact — proximity is the exact thing training optimized for.
# The shape of what training does (conceptual):
# loss makes sim(anchor, positive) HIGH and sim(anchor, negative) LOW
loss = -log( exp(sim(anchor, positive)) /
(exp(sim(anchor, positive)) + sum(exp(sim(anchor, neg)) for neg in negatives)) )
So a random projection with the same 384 dimensions gives garbage: it has dimensions and you can run cosine on it, but no training shaped its geometry. The mechanism is the loss, not the metric. The metric only reads back the geometry the loss built.
a = model.encode("how to fix a flat tyre", normalize_embeddings=True)
b = model.encode("repairing a punctured wheel", normalize_embeddings=True)
c = model.encode("the stock market closed higher", normalize_embeddings=True)
print(util.cos_sim(a, b)) # high -- same meaning, different words
print(util.cos_sim(a, c)) # low -- unrelated
Why not just embed the whole document as one vector?
Granularity. One embedded unit = one retrievable address. If the whole document is a single vector, the smallest thing you can retrieve is the whole document — all-or-nothing. You can't point the model at the one paragraph that answers the query.
# Whole doc = 1 vector -> retrieval returns the entire 40-page PDF or nothing.
# Chunks = N vectors -> retrieval returns the 1 paragraph that matched.
index_unit = "document" # coarse: you get everything or nothing
index_unit = "chunk" # addressable: you get the relevant span
This — not context-window size — is the real reason to chunk. Even with an infinite context window, you'd still chunk, because retrieval needs something smaller than the document to point at.
Then why not embed single sentences — even finer?
Too little context. A lone sentence vector can miss what the sentence refers to ("It doubled last year" — what doubled?). Single sentences are precise but semantically thin; whole docs are rich but unaddressable. A chunk is the deliberate middle: big enough to carry context, small enough to retrieve.
How do we chunk without destroying meaning?
The naive way splits blindly on a character count and slices sentences in half. The clean way respects structure (paragraph → line → sentence → word) and overlaps chunks so a thought straddling a boundary survives.
# ✗ Painful: fixed slice -- cuts words and sentences mid-thought
chunks = [text[i:i+500] for i in range(0, len(text), 500)]
# ✓ Clean: structure-aware, with overlap to preserve cross-boundary meaning
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=80, # carry context across cut points
separators=["\n\n", "\n", ". ", " ", ""], # try paragraphs first, chars last
)
chunks = splitter.split_text(text)
What chunk size is "right"?
There's no a-priori answer — it depends on the corpus. Dense technical text wants smaller chunks; narrative text tolerates larger. You don't reason it out; you measure it. Try 256 / 512 / 1024 and compare.
Measure against what?
Retrieval quality metrics: recall@k (did the right chunk make the top-k?) and nDCG@k (is it ranked high, not just present?).
# Build the index once, then sweep chunk sizes against the metrics.
import faiss, numpy as np
emb = model.encode(chunks, normalize_embeddings=True) # (N, 384)
index = faiss.IndexFlatIP(emb.shape[1]) # inner product on
index.add(np.asarray(emb)) # normalized vecs = cosine
q = model.encode(["how do I patch a tyre?"], normalize_embeddings=True)
scores, ids = index.search(np.asarray(q), k=5) # top-5 chunk ids
def recall_at_k(retrieved_ids, relevant_ids, k):
return len(set(retrieved_ids[:k]) & set(relevant_ids)) / len(relevant_ids)
# Sweep: pick the chunk_size that maximizes recall@k / nDCG@k on YOUR corpus.
But those metrics need a "correct" answer. Where does that come from?
From human relevance judgments (qrels): a person decides "this chunk answers this query." That's the bedrock — and it bottoms out at a definition, not a deeper fact. In information retrieval, "relevant" is defined by human information need. There is no ground beneath it to dig toward.
# qrels: the human-defined ground truth every metric is scored against
qrels = {
"how do I patch a tyre?": ["chunk_0042"], # a human marked this as the answer
}
Can we ever be certain the right chunk was retrieved?
No — and we don't pretend to. That uncertainty is exactly why we retrieve top-k instead of top-1: give the generator several candidates and let it use the ones that help. Top-k is the hedge against the fact that vector proximity is a strong proxy for meaning, never a proof of it.
Breadth pass: the menus hidden behind "it depends"
The chain above contracts toward bedrock. But two branches closed on "it depends — tune it" (chunk size; model choice), and an "it depends" is always hiding a menu. This section expands those menus. Use the chain to understand why; use this to decide which.
Chunking strategies — which, when
Default first: recursive splitting at ~512 tokens with 10–20% overlap is the right starting point for most systems — fast, cheap, ~highest end-to-end accuracy in current benchmarks. Everything else is an upgrade bought against a specific bottleneck your metrics revealed.
| Strategy | What it does | Reach for it when | Cost |
|---|---|---|---|
| Fixed-size | Slices every N tokens, blind to structure | Never for real RAG — baseline only | ~free |
| Recursive | Splits on paragraph→sentence→word hierarchy | Default. Mixed docs, most corpora | low |
| Sentence-based | One/few sentences per chunk | Short Q&A, FAQ; cheap precision | low |
| Document-structure | Splits on markdown headers / pages | Strong native boundaries; paginated PDFs | low |
| Semantic | Embeds sentences, cuts where meaning shifts | Retrieval precision is the bottleneck | ~14× slower |
| Parent-document (small→large) | Retrieve small chunks, return their larger parent | Match is right but LLM needs more surrounding context | low–med |
| LLM-based (LumberChunker) | An LLM picks boundaries | Subtle boundaries in narrative; cost no object | high |
| Late chunking | Embed the whole doc first, then split token embeddings | Long docs with cross-references (entity named once, used throughout) | embed-only |
| Contextual retrieval (Anthropic) | Prepend an LLM-written context blurb to each chunk before embedding | Chunks lose meaning out of context; LLM cost acceptable | high |
Three triggers worth memorizing:
- Precision is the bottleneck → semantic chunking (wins point-accuracy, but ~14× slower to index).
- Cross-references break (pronouns/entities pointing across boundaries) → late chunking (embed-only, cheaper) or contextual retrieval (LLM cost, often higher quality).
- Retrieval is right but the LLM lacks context → parent-document: match on small chunks, hand back the big parent.
One overturned rule: overlap is no longer universal — it still helps dense retrieval but gives no measurable benefit with sparse (SPLADE) retrieval. Test it; don't assume it.
# Default
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=80)
# Semantic — cut where meaning shifts, not where a counter hits 512
from langchain_experimental.text_splitter import SemanticChunker
from langchain_huggingface import HuggingFaceEmbeddings
semantic = SemanticChunker(HuggingFaceEmbeddings(model_name="all-mpnet-base-v2"))
# Contextual retrieval — make each chunk self-contained BEFORE embedding
context = llm(f"One sentence situating this chunk in the document:\n{doc}\n{chunk}")
embedding_input = context + "\n" + chunk # embed THIS, not the bare chunk
Embedding — the choices the chain compressed
"An embedding is a vector from a trained model" is the bedrock. In practice that one line hides several decisions that change retrieval quality more than chunk size does.
1. Which model? Shortlist from the MTEB leaderboard's retrieval sub-task, then test on your own corpus — the leaderboard is a filter, not a verdict. Rough 2026 landscape:
| Model | Why pick it |
|---|---|
| Qwen3-Embedding-8B | MTEB leader (~70.6); best if you can self-host |
| Gemini Embedding | Strongest API for retrieval quality (~68) |
| Voyage-3-large | Retrieval-focused API; beats OpenAI on retrieval |
| OpenAI text-embedding-3-large | Mid-pack quality (~64.6) but unmatched ecosystem/SDK support |
| BGE-M3 | Open-source all-rounder: dense + sparse + multi-vector in one model |
| Cohere Embed v4 | Multimodal (text + images), long context |
| Nomic Embed / all-MiniLM-L6-v2 | Lightweight; run on a laptop CPU / edge |
2. Dense vs sparse vs hybrid — the gotcha the chain hides. Dense embeddings blur exact tokens: part numbers, SKUs, error codes, rare proper nouns. Sparse retrieval (BM25 / SPLADE) nails exact terms but misses paraphrase. Production answer is usually hybrid — run both, fuse with Reciprocal Rank Fusion.
# Dense alone fails on identifiers:
# query "error E-4021" ~ chunk "exception code E-4019" -> looks similar, WRONG
# Sparse alone fails on paraphrase:
# query "flat tyre" vs chunk "punctured wheel" -> zero overlap, WRONG
# Hybrid = dense (meaning) + sparse (exact terms), scores fused via RRF.
3. Reranking beats "a better model" more often than you'd think. A bi-encoder embeds query and chunk separately (fast, scales). A cross-encoder reads query+chunk together (slow, sharp). Retrieve top-50 with the bi-encoder, rerank to top-5 with a cross-encoder. If the right passage is in your top-50 but not your top-5, reranking helps more than swapping embedding models.
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2")
pairs = [(query, c) for c in candidate_chunks] # the top-50 from vector search
ranked = sorted(zip(candidate_chunks, reranker.predict(pairs)),
key=lambda x: x[1], reverse=True)[:5]
4. Query/document asymmetry — a silent quality killer. Many models (E5, BGE, EmbeddingGemma) were trained with prompts and expect "query:" / "passage:" prefixes or a task instruction. Skip them and retrieval quietly degrades — no error, just worse results.
# Wrong: same raw text for both. Right: use the model's prescribed prefixes.
q = model.encode("query: how do I patch a tyre?", normalize_embeddings=True)
d = model.encode("passage: repairing a punctured wheel", normalize_embeddings=True)
5. Dimensions are a direct cost multiplier — and now adjustable. Vector dimension multiplies storage, memory, and search latency. Matryoshka models let you truncate (e.g. 3072 → 256) for a ~12× storage cut at ~85% quality. Tune dimension like you tune chunk size.
# Matryoshka: ask for fewer dims at encode time, no retraining
emb = model.encode(chunks, truncate_dim=256, normalize_embeddings=True)
6. Mind the model's max sequence length. Every embedding model has a token limit; a chunk longer than it is silently truncated — the tail never gets embedded. Your chunk size must fit inside the model's window, or retrieval misses the cut-off content.
The first principles it rests on
Verify these on revisit — everything above is built from them:
- Meaning is operationally defined, not metaphysical. In this system, "meaning" = the distributional context learned from training data. The model knows nothing else. (definition)
- Vector proximity tracks meaning because training made it so. Embeddings are produced by a contrastive objective that pulls related pairs together and pushes unrelated pairs apart. The geometry is the trained objective; the distance metric only reads it. (mechanism / definition of the loss)
- One embedded unit = one retrievable address. Chunking is forced by retrieval granularity, not by context-window size. (logical consequence of how a vector index works)
- A chunk is a deliberate context-vs-addressability tradeoff. Sentences are precise but thin; documents are rich but unaddressable; chunks sit in between. (consequence of #3)
- Chunk size has no a-priori answer. It's tuned empirically per corpus against recall@k / nDCG@k. (convention grounded in measurement)
- Retrieval metrics rest on human relevance judgments. "Relevant" is defined by human information need; there is no deeper ground. (definition)
- Retrieval is probabilistic, never certain. Top-k exists because vector proximity is a strong proxy for meaning, not a guarantee. (direct consequence of #1 and #2)
- Dense embeddings blur exact tokens. They encode meaning, so identifiers, codes, and rare terms need sparse retrieval alongside — hence hybrid search. (consequence of #1: meaning ≠ surface form)
- Reranking and retrieval are different operations. A bi-encoder scores query and chunk apart (scalable); a cross-encoder scores them together (sharper). Fixing rank order ≠ fixing recall. (definition)
- Vector dimension is a direct cost multiplier. Storage, memory, and latency scale with it; Matryoshka models let you trade dimension for quality deliberately. (arithmetic / definition of ANN cost)
- An embedding never sees text beyond the model's max sequence length. Over-long chunks are silently truncated. (definition of the model's input limit)
Principles 1–7 are the depth chain's bedrock; 8–11 are what the breadth pass rests on. The same rule governs both: meaning is learned and geometric, surface form is not — every practical choice falls out of that split.
Glossary
| Acronym | Full form |
|---|---|
| ANN | Approximate Nearest Neighbor |
| API | Application Programming Interface |
| BM25 | Best Matching 25 |
| CPU | Central Processing Unit |
| FAQ | Frequently Asked Questions |
| IP | Inner Product (as in IndexFlatIP) |
| LLM | Large Language Model |
| MRL | Matryoshka Representation Learning |
| MTEB | Massive Text Embedding Benchmark |
| nDCG | normalized Discounted Cumulative Gain |
| Portable Document Format | |
| RAG | Retrieval-Augmented Generation |
| RRF | Reciprocal Rank Fusion |
| SDK | Software Development Kit |
| SKU | Stock Keeping Unit |
| SPLADE | Sparse Lexical and Expansion model |
| TF-IDF | Term Frequency–Inverse Document Frequency |