We Tested 4 RAG Strategies on the Same Data — The Results Surprised Us

TL;DR

We ran the same 15 documents through four RAG configurations. Same gold question set, completely different setup for each. Recall@K ranged from 64% to 92%. The biggest gap was not driven by the embedding model. It was driven by chunking strategy and retrieval mode. And the cheapest model in the experiment finished second place.

We ran the same dataset through 4 different RAG strategies.

Same documents.
Same queries.
Same gold question set.

Recall ranged from 64% to 92%.

The biggest surprise was not which model won.

It was how wrong our assumptions were about what actually drives retrieval quality.

What Most Teams Do

Most teams pick a RAG setup the same way:

choose an embedding model from benchmarks
pick a chunking strategy based on instinct or convention
maybe add hybrid search if they've heard it helps
ship

If results are not great, they tweak one thing, re-embed, and test again. There is usually no systematic way to know what actually made things better or worse.

RAG configurations get decided by whoever set them up first and how much time they had. Not by data. We wanted to know: how much does it actually matter?

The Experiment

We took a small corpus (15 technical documents, mixed content types) and built a gold evaluation set by generating questions directly from the document content. Each question had a known source chunk as the expected answer.

Then we ran four distinct configurations against the same gold set. Everything except the strategy was held constant: same documents, same questions, same evaluation metrics.

Strategy	Embedding	Chunking	Retrieval
A	text-embedding-3-small	Recursive (256 chars)	Vector
B	text-embedding-3-large	Semantic (512 chars)	Vector
C	text-embedding-3-large	Late chunking	Hybrid (α=0.65) + rerank
D	gte-large	Page-based	Hybrid + rerank

Four Configurations, Same Documents and Questions

The Results

Strategy	Recall@K	MRR	Avg Latency	Embed Cost
A: 3-small · recursive · vector	64%	0.57	28ms	$0.02/1M
B: 3-large · semantic · vector	78%	0.71	44ms	$0.13/1M
C: 3-large · late · hybrid + rerank	92%	0.89	118ms	$0.13/1M
D: gte-large · page · hybrid + rerank	88%	0.84	203ms	$0.01/1M

Run on a 15-document technical corpus with 40 gold questions. Your numbers will vary. This is the point.

Recall@K and MRR Across All Four Strategies

What Surprised Us

1. The gap was huge

We expected to see meaningful differences. We did not expect 28 percentage points between the worst and best configurations.

A 64% Recall@K means that on roughly 1 in 3 queries, the correct chunk did not appear in the top results at all. The LLM was working with the wrong context nearly a third of the time, and nothing in the logs would tell you.

At 92%, that drops to fewer than 1 in 12. That is not a marginal improvement. That is a different product.

28pts

Gap between worst and best config

1 in 3

Queries with wrong context at 64% recall

13x

Cheaper: gte-large vs 3-large

2. Chunking mattered more than the model

This was the clearest unexpected result.

Going from Strategy A to B improved Recall@K by 14 points, but that combines two changes: upgrading the embedding model and switching chunking methods. When we isolated the effect of retrieval mode by comparing B and C (same embedding model, same chunking, but adding hybrid search and reranking), accuracy jumped another 14 points.

The intuition most people have is that the embedding model is the primary driver of retrieval quality. Our data suggests chunking strategy and retrieval mode together have an equal or larger effect. And chunking is free to change; the model swap costs 6.5x more per token.

The embedding model gets most of the attention because it's the visible decision. Chunking is treated like plumbing. Our data suggests the plumbing matters at least as much as the model.

3. Hybrid search plus reranking was consistently better

Strategies A and B used pure vector search. Strategies C and D used hybrid search (blending dense vector similarity with TF-IDF keyword matching) plus a second-pass reranking step.

Both C and D outperformed both A and B by a significant margin, regardless of the embedding model used. The latency cost was real (118ms vs 44ms for the vector-only configs), but the accuracy gain was consistent.

This matters particularly for technical documentation and structured content, where queries often contain specific terminology that semantic similarity underweights. Keyword signal fills in where dense retrieval misses.

4. The cheap model nearly won

Strategy D used gte-large, an open model at $0.01/1M tokens, 13x cheaper than the OpenAI large model in Strategy C. It finished 4 points behind on Recall@K and 5 points behind on MRR.

Whether that gap is worth 13x the embedding cost depends entirely on your use case and scale. But the fact that it is even close was not the result we expected.

Open models on third-party providers can have inconsistent availability. The gte-large result above reflects a working configuration, but we have seen this model return degraded embeddings under certain API conditions. Always validate open models against your actual data before committing to them in production.

Cost vs. Quality: The Sweet Spot Is Not Where You Would Expect

What This Means

RAG performance is not something you can intuit. The configuration decisions that teams make quickly: chunk size, chunking method, embedding model, retrieval mode. They interact in ways that compound.

Most teams are making these decisions without measuring them on their own data. The gap between a guessed configuration and a measured one is typically 15 to 30 accuracy points on structured domains. At any production scale, that gap shows up as user-facing quality degradation: answers that are plausible but wrong, context that is adjacent but not quite right. Nothing in the logs tells you why.

A Better Workflow

Take a small sample: 10 to 20 representative documents from your corpus
Generate a gold question set: derive test questions directly from your document content
Run multiple strategies in parallel: different models, chunking methods, retrieval modes against the same eval set
Compare retrieval quality directly: Recall@K, MRR, latency, cost across strategies
Pick the winner: commit to the configuration with the best tradeoff for your use case
Embed at scale, once, with confidence

This takes minutes, not days. And it tells you something the MTEB leaderboard cannot: how a given configuration performs on your documents, against your queries.

MTEB tells you how models perform on benchmark datasets. Your users are not asking MTEB questions. The only meaningful signal is a controlled experiment on your own data.

Run this experiment on your own documents

Upload your documents, pick strategies, and see your own Recall@K and MRR numbers

Open RAG Lab

The Practical Tradeoffs

Every result above involves a real tradeoff. Here is how to read them.

Maximum accuracy, cost not a constraint

Strategy C (large model, late chunking, hybrid + rerank) performed best. Late chunking preserves context across chunk boundaries, which matters for structured documents with cross-references. The 118ms retrieval latency is acceptable for most applications.

Cost is a primary constraint

Strategy D (gte-large, page chunking, hybrid + rerank) delivered 88% accuracy at 13x lower embedding cost. Validate it on your data before committing, but the cost profile is hard to ignore at scale.

Latency is the constraint

Strategy B (large model, semantic chunking, vector-only) hit 78% accuracy at 44ms average latency. Pure vector search is fast. You give up accuracy, but it is a real tradeoff, not just a worse config.

The config you should avoid

Strategy A is the default that most teams land on. It is cheap, fast to set up, and fine for prototyping. It is also 28 points behind the best configuration on our data. Shipping it to production without measuring it is a coin flip.

Running This on Your Own Data

Upload your sample documents to RAG Lab. The system extracts text and auto-generates a gold question set using Claude against your actual content: up to 40 questions derived from your specific corpus, tagged by difficulty, with source chunks recorded as expected context. No manual labeling.

Select the strategies you want to compare, run them in parallel, and get Recall@K, MRR, latency, and cost back for each. Strategies that share an embedding model and chunking config reuse embeddings. Comparing retrieval modes on the same base setup does not double your costs.

When you have found the configuration that fits your tradeoffs, save it. That strategy becomes available in the SDK:

python

from decompressed_sdk import DecompressedClient
dc = DecompressedClient(api_key="dck_...")
# The strategy you measured and validated in the lab
result = dc.lab.embed(
    texts=production_chunks,
    preset_id="scholar"  # or your saved custom strategy ID
)

The Takeaway

RAG systems do not fail because the LLM is bad.

They fail because the retrieval setup is wrong, and nobody measured it before shipping.

The 28-point gap between our worst and best configuration existed on the same 15 documents, the same 40 questions, the same gold set. The difference was setup. A different chunk boundary, a hybrid weight, a reranking step. Nothing that would show up in a code review.

That gap is measurable. The experiment takes minutes. Most teams just never run it.

The biggest RAG improvements are in the retrieval setup

Upload a sample corpus to RAG Lab. It generates the gold questions, runs your chosen strategies in parallel, and returns Recall@K, MRR, latency, and cost before you commit to a production index.

Run your first experiment