TL;DR
We ran the same 15 documents through four RAG configurations. Same gold question set, completely different setup for each. Recall@K ranged from 64% to 92%. The biggest gap was not driven by the embedding model. It was driven by chunking strategy and retrieval mode. And the cheapest model in the experiment finished second place.
We ran the same dataset through 4 different RAG strategies.
Same documents.
Same queries.
Same gold question set.
Recall ranged from 64% to 92%.
The biggest surprise was not which model won.
It was how wrong our assumptions were about what actually drives retrieval quality.
What Most Teams Do
Most teams pick a RAG setup the same way:
- choose an embedding model from benchmarks
- pick a chunking strategy based on instinct or convention
- maybe add hybrid search if they've heard it helps
- ship
If results are not great, they tweak one thing, re-embed, and test again. There is usually no systematic way to know what actually made things better or worse.
RAG configurations get decided by whoever set them up first and how much time they had. Not by data. We wanted to know: how much does it actually matter?
The Experiment
We took a small corpus (15 technical documents, mixed content types) and built a gold evaluation set by generating questions directly from the document content. Each question had a known source chunk as the expected answer.
Then we ran four distinct configurations against the same gold set. Everything except the strategy was held constant: same documents, same questions, same evaluation metrics.
| Strategy | Embedding | Chunking | Retrieval |
|---|---|---|---|
| A | text-embedding-3-small | Recursive (256 chars) | Vector |
| B | text-embedding-3-large | Semantic (512 chars) | Vector |
| C | text-embedding-3-large | Late chunking | Hybrid (α=0.65) + rerank |
| D | gte-large | Page-based | Hybrid + rerank |
The Results
| Strategy | Recall@K | MRR | Avg Latency | Embed Cost |
|---|---|---|---|---|
| A: 3-small · recursive · vector | 64% | 0.57 | 28ms | $0.02/1M |
| B: 3-large · semantic · vector | 78% | 0.71 | 44ms | $0.13/1M |
| C: 3-large · late · hybrid + rerank | 92% | 0.89 | 118ms | $0.13/1M |
| D: gte-large · page · hybrid + rerank | 88% | 0.84 | 203ms | $0.01/1M |
Run on a 15-document technical corpus with 40 gold questions. Your numbers will vary. This is the point.
What Surprised Us
1. The gap was huge
We expected to see meaningful differences. We did not expect 28 percentage points between the worst and best configurations.
A 64% Recall@K means that on roughly 1 in 3 queries, the correct chunk did not appear in the top results at all. The LLM was working with the wrong context nearly a third of the time, and nothing in the logs would tell you.
At 92%, that drops to fewer than 1 in 12. That is not a marginal improvement. That is a different product.
2. Chunking mattered more than the model
This was the clearest unexpected result.
Going from Strategy A to B improved Recall@K by 14 points, but that combines two changes: upgrading the embedding model and switching chunking methods. When we isolated the effect of retrieval mode by comparing B and C (same embedding model, same chunking, but adding hybrid search and reranking), accuracy jumped another 14 points.
The intuition most people have is that the embedding model is the primary driver of retrieval quality. Our data suggests chunking strategy and retrieval mode together have an equal or larger effect. And chunking is free to change; the model swap costs 6.5x more per token.
The embedding model gets most of the attention because it's the visible decision. Chunking is treated like plumbing. Our data suggests the plumbing matters at least as much as the model.
3. Hybrid search plus reranking was consistently better
Strategies A and B used pure vector search. Strategies C and D used hybrid search (blending dense vector similarity with TF-IDF keyword matching) plus a second-pass reranking step.
Both C and D outperformed both A and B by a significant margin, regardless of the embedding model used. The latency cost was real (118ms vs 44ms for the vector-only configs), but the accuracy gain was consistent.
This matters particularly for technical documentation and structured content, where queries often contain specific terminology that semantic similarity underweights. Keyword signal fills in where dense retrieval misses.
4. The cheap model nearly won
Strategy D used gte-large, an open model at $0.01/1M tokens, 13x cheaper than the OpenAI large model in Strategy C. It finished 4 points behind on Recall@K and 5 points behind on MRR.
Whether that gap is worth 13x the embedding cost depends entirely on your use case and scale. But the fact that it is even close was not the result we expected.
Open models on third-party providers can have inconsistent availability. The gte-large result above reflects a working configuration, but we have seen this model return degraded embeddings under certain API conditions. Always validate open models against your actual data before committing to them in production.
What This Means
RAG performance is not something you can intuit. The configuration decisions that teams make quickly: chunk size, chunking method, embedding model, retrieval mode. They interact in ways that compound.
Most teams are making these decisions without measuring them on their own data. The gap between a guessed configuration and a measured one is typically 15 to 30 accuracy points on structured domains. At any production scale, that gap shows up as user-facing quality degradation: answers that are plausible but wrong, context that is adjacent but not quite right. Nothing in the logs tells you why.
A Better Workflow
- Take a small sample: 10 to 20 representative documents from your corpus
- Generate a gold question set: derive test questions directly from your document content
- Run multiple strategies in parallel: different models, chunking methods, retrieval modes against the same eval set
- Compare retrieval quality directly: Recall@K, MRR, latency, cost across strategies
- Pick the winner: commit to the configuration with the best tradeoff for your use case
- Embed at scale, once, with confidence
This takes minutes, not days. And it tells you something the MTEB leaderboard cannot: how a given configuration performs on your documents, against your queries.
MTEB tells you how models perform on benchmark datasets. Your users are not asking MTEB questions. The only meaningful signal is a controlled experiment on your own data.
Upload your documents, pick strategies, and see your own Recall@K and MRR numbers
Open RAG LabThe Practical Tradeoffs
Every result above involves a real tradeoff. Here is how to read them.
Maximum accuracy, cost not a constraint
Strategy C (large model, late chunking, hybrid + rerank) performed best. Late chunking preserves context across chunk boundaries, which matters for structured documents with cross-references. The 118ms retrieval latency is acceptable for most applications.
Cost is a primary constraint
Strategy D (gte-large, page chunking, hybrid + rerank) delivered 88% accuracy at 13x lower embedding cost. Validate it on your data before committing, but the cost profile is hard to ignore at scale.
Latency is the constraint
Strategy B (large model, semantic chunking, vector-only) hit 78% accuracy at 44ms average latency. Pure vector search is fast. You give up accuracy, but it is a real tradeoff, not just a worse config.
The config you should avoid
Strategy A is the default that most teams land on. It is cheap, fast to set up, and fine for prototyping. It is also 28 points behind the best configuration on our data. Shipping it to production without measuring it is a coin flip.
Running This on Your Own Data
Upload your sample documents to RAG Lab. The system extracts text and auto-generates a gold question set using Claude against your actual content: up to 40 questions derived from your specific corpus, tagged by difficulty, with source chunks recorded as expected context. No manual labeling.
Select the strategies you want to compare, run them in parallel, and get Recall@K, MRR, latency, and cost back for each. Strategies that share an embedding model and chunking config reuse embeddings. Comparing retrieval modes on the same base setup does not double your costs.
When you have found the configuration that fits your tradeoffs, save it. That strategy becomes available in the SDK:
from decompressed_sdk import DecompressedClientdc = DecompressedClient(api_key="dck_...")# The strategy you measured and validated in the labresult = dc.lab.embed( texts=production_chunks, preset_id="scholar" # or your saved custom strategy ID)The Takeaway
RAG systems do not fail because the LLM is bad.
They fail because the retrieval setup is wrong, and nobody measured it before shipping.
The 28-point gap between our worst and best configuration existed on the same 15 documents, the same 40 questions, the same gold set. The difference was setup. A different chunk boundary, a hybrid weight, a reranking step. Nothing that would show up in a code review.
That gap is measurable. The experiment takes minutes. Most teams just never run it.
Upload a sample corpus to RAG Lab. It generates the gold questions, runs your chosen strategies in parallel, and returns Recall@K, MRR, latency, and cost before you commit to a production index.
Run your first experiment