TL;DR
Embedding your full corpus before testing your retrieval strategy is a waste of money and time. The iteration cost compounds fast: every chunk size change, model swap, or retrieval config tweak means re-embedding everything. The fix is sample-first RAG design. Upload 5 to 20 representative documents, auto-generate a gold question set, compare strategies with real metrics (Recall@K, MRR, latency, cost), then embed at scale only when you know what works.
You have 50,000 documents. You pick text-embedding-3-large because it benchmarks well on MTEB. You write a chunking function with 512-character windows and 50-character overlaps. You embed everything. It costs $47 and takes four hours.
You run your first real queries. Results are mediocre. You suspect the chunk size is wrong. Maybe 256 works better for your document structure. You re-embed. Another $47. Another four hours.
Then someone asks if you've tried hybrid search. You add BM25 on top of dense retrieval, but now you're not sure the embedding model is even right for hybrid. You test text-embedding-3-small: a different space, incomparable cosine distances, so you need a clean full re-embed. $4 this time, but three more hours.
Six iterations later, you've spent roughly $200 and two full days on embedding experiments alone, before writing a single line of production retrieval code.
This is the default RAG workflow. Almost everyone follows it. And it is almost entirely waste.
The costs are not just the API bills. Every hour your engineers spend waiting on embedding jobs is an hour not spent on evaluation, query rewriting, or the actual retrieval logic that determines whether your product works. And the hidden cost is worse: every time you run queries against a suboptimal index, you make inferences about your architecture that won't hold once the config improves. You build on false baselines.
There's another dimension people rarely talk about: lock-in. Once you've committed 50,000 documents to a model's embedding space, migrating to a better model means re-embedding the entire corpus. Once your chunking pipeline is in production, changing it invalidates every stored vector. The cost of the initial decision compounds into every future decision.
The rule is simple: you should never embed your full dataset until you've tested on a sample. It sounds obvious stated plainly. But the tooling makes the expensive path the default, so almost no one follows it.
Why the Iteration Cost Compounds
The problem is not a single re-embedding run. It's the number of variables you're iterating over simultaneously and the way they interact.
A typical RAG configuration has at least four independent dimensions:
| Variable | Common Options | Interaction Effects |
|---|---|---|
| Chunk size | 128, 256, 512, 1024+ chars | Affects semantic density, overlap behavior, token count |
| Chunking method | Fixed, recursive, semantic, page | Changes boundary placement, context continuity |
| Embedding model | 3-small, 3-large, open models | Different spaces, incomparable cosine distances |
| Retrieval mode | Vector, keyword, hybrid | Performance varies dramatically by domain and query type |
These are not additive: they interact. The optimal chunk size for semantic chunking is different from fixed-window chunking. Hybrid search with alpha=0.65 might outperform pure vector search on technical documentation but lose to pure vector on narrative content. Two options per dimension gives you 16 combinations. On 50,000 documents with text-embedding-3-large at $0.13/1M tokens, that's roughly $4 per full embedding run. Sixteen experiments equals $64, plus the engineering time to set up each evaluation.
Most teams do not run 16 systematic experiments. They pick one config, embed, test, feel uncertain, tweak one parameter, embed again, and repeat until something feels good enough. Feeling good enough is not a retrieval strategy.
What Makes RAG Hard to Tune
Here's the non-obvious part: small changes to your configuration do not produce proportional changes to retrieval quality. They produce step changes, and sometimes inversions.
Chunk Size Is Not a Smooth Dial
The common mental model: larger chunks equal more context and better answers; smaller chunks mean higher precision and better recall. Neither is reliably true.
A chunk needs to be semantically complete, meaning a reader could understand what it's about without the surrounding text. Too small and you lose that completeness, fragmenting context the embedding needs to produce a useful vector. Too large and you dilute the signal: the embedding tries to represent too much at once, so semantically distinct sections from different concepts end up closer together than they should be.
The optimal size is document-structure dependent. Dense technical documentation with self-contained sections behaves differently from long-form narrative or Q&A threads. A 10% chunk size change can produce a 15 to 20 percentage point shift in retrieval accuracy on structured domains. That's not an edge case: it's why teams keep re-embedding.
Embedding Models Don't Live in the Same Space
Cosine similarity only means something within a single model's embedding space. You cannot compare a vector produced by text-embedding-3-small to one produced by text-embedding-3-large: they do not share a coordinate system. Running queries against a mixed-model index produces nonsense similarity scores, not a meaningful comparison.
The other dimension is cost vs. quality. text-embedding-3-small runs at $0.02/1M tokens. text-embedding-3-large runs at $0.13/1M tokens, 6.5x more expensive. Whether that quality difference justifies the cost is entirely domain-specific, and the only honest way to know is to measure it against your actual documents and queries.
Hybrid Search: How Much Does Keyword Weight Matter
Pure vector search fails on keyword-heavy queries. If someone searches for “section 12.4.b of the compliance policy,” dense embeddings might retrieve semantically related sections while completely missing the one that literally matches the reference.
What's less discussed is that the hybrid weight (alpha, the blend between dense vector similarity and BM25 keyword matching) also requires tuning. An alpha=0.65 that works well on general-purpose knowledge bases might underperform on code repositories where exact function names matter more. An alpha=0.4 that handles technical documentation well might struggle with narrative content where keyword overlap is low.
The interactions between model, chunk size, and retrieval mode are why the parameter space is large and why embedding everything before testing is so expensive. You're not iterating over one variable. You're iterating over a product space.
Sample-First RAG Design
The fix is not smarter iteration. It's a different sequence.
The core insight: you do not need the full corpus to evaluate a retrieval strategy. You need a representative sample and a labeled evaluation set. If your evaluation shows Strategy A scores 0.72 MRR and Strategy B scores 0.89 MRR on a 20-document sample with 40 test questions, you can be reasonably confident that difference will hold at scale. The relative ranking of strategies is usually stable even when absolute numbers shift.
The sample-first workflow:
- Upload 5 to 20 representative documents covering the range of content types, structures, and lengths in your corpus
- Generate a gold set automatically by deriving test questions directly from your documents
- Test multiple strategies simultaneously across different models, chunk sizes, and retrieval modes
- Measure with real metrics: Recall@K, MRR, retrieval latency, and embedding cost
- Pick the winner based on the tradeoff that fits your use case
- Embed at scale, once, with confidence
This is not a novel idea. It's how model evaluation works in standard ML practice. You do not train on all your data without a validation set. The same principle applies to RAG configuration: test before you invest.
“Subjective quality” is not a retrieval metric. If your evaluation process is “we ran some queries and the results looked reasonable,” you do not have a retrieval strategy. You have a guess. The gap between a guessed configuration and a measured one is typically 15 to 25 accuracy points on structured domains.
Upload a document, pick strategies, and see real Recall@K and MRR numbers above
Open RAG LabThe Gold Set Problem
The hardest part of sample-first evaluation is building the evaluation set. Manually labeling questions and expected answers is time-consuming, and the queries you need are specific to your domain and your documents. You cannot import a generic benchmark dataset.
The solution is synthetic gold set generation.
Given a chunk of text from your documents, ask an LLM to generate realistic questions a user would ask if they were trying to retrieve that chunk. Use those questions as evaluation queries, with the source chunk as the expected context.
This is imperfect by design: the questions are generated from the documents themselves, so they will not cover questions the documents cannot answer. But for retrieval evaluation, that's exactly what you want. You are testing whether your pipeline can surface the right chunk given a realistic query, not whether the document contains the right information.
A well-constructed gold set for 20 documents gives you 30 to 40 test questions covering different difficulty levels: factual lookups, conceptual questions, and procedural queries. That's sufficient to differentiate retrieval strategies with statistical confidence.
If Strategy A scores 0.85 accuracy and Strategy B scores 0.68 accuracy across 40 questions, act on it. If the margin is 0.85 vs. 0.84, pick on secondary criteria like latency or cost, and move on.
How to Read the Metrics
If you're new to systematic RAG evaluation, the metrics can be confusing. Here's what each one actually tells you.
Accuracy (Recall@K)
For each gold question, did the expected source chunk appear in the top K retrieved results? Binary per question: the right chunk is either in the top 5 or it isn't. Averaged across your gold set, it tells you how often your pipeline surfaces the right content.
MRR (Mean Reciprocal Rank)
Similar to Recall@K, but rewards ranking the correct chunk higher. Rank 1 = 1.0, rank 2 = 0.5, rank 3 = 0.33, not found = 0. A strategy with 0.80 accuracy but 0.45 MRR is surfacing the right chunks but burying them below irrelevant results. That matters if your LLM only processes the top 2 to 3 retrieved chunks.
Latency and Cost
A strategy with 0.87 accuracy and 280ms average retrieval is not obviously better than one with 0.84 accuracy and 45ms retrieval for a real-time application. Similarly, the 6.5x price difference between text-embedding-3-small and text-embedding-3-large compounds at scale. These four metrics together tell you how to make the tradeoff. That is the point.
| Strategy | Accuracy | MRR | Avg Latency | Embed Cost |
|---|---|---|---|---|
| Economy3-small · 256d · recursive · vector | 0.71 | 0.55 | 28ms | $0.02/1M |
| Balanced3-large · semantic · vector | 0.84 | 0.72 | 51ms | $0.13/1M |
| High Accuracy3-large · late · hybrid (α=0.65) + rerank | 0.91 | 0.83 | 210ms | $0.13/1M |
| Hybrid Searchgte-large · page · hybrid (α=0.4) + rerank | 0.87 | 0.78 | 195ms | $0.01/1M |
Illustrative comparison on a technical documentation corpus using the four built-in presets. Your numbers will vary: this is the point.
Putting It Into Practice with Decompressed RAG Lab
This is the workflow Decompressed RAG Lab is built around.
You upload your sample documents: PDFs, markdown files, code, plain text. The system extracts text, then auto-generates a gold set using Claude 3 Haiku against your actual chunks. Up to 20 questions derived from your specific content, tagged by difficulty (factual, conceptual, procedural), with the source chunk recorded as expected context. No manual labeling.
You select which strategies to compare. Supported embedding models include text-embedding-3-small and text-embedding-3-large (with Matryoshka dimension reduction), Perplexity's pplx-embed-v1-4b and pplx-embed-v1-0.6b, and open models including gte-large and e5-large-v2. Chunking methods span fixed, recursive, semantic, late chunking, and page-based segmentation. Retrieval modes cover vector, keyword (TF-IDF), and weighted hybrid.
The lab runs each strategy against your gold set and returns accuracy, MRR, latency, and cost scores. Strategies that share the same embedding model and chunking configuration reuse embeddings, so comparing hybrid vs. vector search on the same base config does not double your embedding spend.
When you've found a configuration that works, you save it. That saved strategy becomes available in the SDK, so you embed your production corpus with exactly the config you validated:
from decompressed_sdk import DecompressedClientdc = DecompressedClient(api_key="dck_...")# Use a built-in preset by IDresult = dc.lab.embed( texts=production_chunks, preset_id="scholar" # ghost · balanced · scholar · hybrid)# Or reference a strategy you saved in the lab by nameresult = dc.lab.embed( texts=production_chunks, strategy="My Validated Strategy")Built-in presets give you a tested starting point: Economy (recursive chunking, text-embedding-3-small at 256 dimensions, vector search), Balanced (semantic chunking, text-embedding-3-large, vector search), High Accuracy (late chunking, text-embedding-3-large, hybrid + rerank), and Hybrid Search (gte-large, page chunking, hybrid + rerank). But the point is to validate against your actual data before committing to any of them.
The Number That Should Bother You
Here's what makes RAG failure modes particularly insidious: they're quiet.
A poorly configured retrieval pipeline does not throw exceptions. It returns results. The results look plausible. Your LLM generates coherent answers from them. Your evaluations pass because your evaluations are often also querying the same suboptimal index with subjective human assessment. The quality degradation is invisible until you compare it against a properly tuned baseline.
The Recall@K gap between a naive default configuration and a measured, optimized one is typically 15 to 25 percentage points on structured domains. On a customer support product handling 10,000 daily queries, a 20-point accuracy gap means 2,000 queries per day surfacing the wrong context. At scale, that is not a benchmark metric. That is your product quality.
The embedding cost to test properly: a 15-document sample with 30 gold questions, five strategies compared, is under $0.05. The re-embedding cost after discovering your production config was wrong? The full corpus, again, from scratch.
The Punchline
RAG is not inherently expensive. The tooling just makes the costly path the obvious one: you have an embedding API and a vector database, so the natural thing is to fill both as fast as possible.
The real cost driver is embedding at scale before you have signal. Chunk size, model choice, and retrieval mode all interact in ways that only become visible when you measure them against your actual documents and your actual queries. There is no shortcut around that measurement, but there is a cheaper way to do it.
Test on a sample. Get the numbers. Then commit to the full corpus with a config you trust. That is the whole idea: sample, evaluate, commit, scale.
Upload 5 to 20 documents to RAG Lab. It generates the gold questions, runs your chosen strategies, and returns the numbers. The first experiment takes under five minutes and costs less than a cent to embed.
Run your first experiment