TL;DR
Pull 10 to 20 representative documents, auto-generate a gold question set, run text-embedding-3-small, text-embedding-3-large, gte-large, and a hybrid config in parallel, and compare Recall@K, MRR, latency, and cost. The whole process takes under an hour and costs less than $0.05. You end up with the right model for your data, not the one that benchmarks best on Wikipedia.
Most teams pick an embedding model by reading benchmark comparisons and making a judgment call. That works fine for a prototype. For production, it tends to cost you.
text-embedding-3-large ranks well on MTEB. Whether it works better than text-embedding-3-small on your support tickets, your legal documents, or your internal knowledge base is a separate question. The answer is often surprising, and it only takes one afternoon to know it with certainty.
Here's the exact process.
Without this process
- Embed full corpus per model: ~$47
- 3 models tested: ~$141
- Config changes, re-embeds: ~$60+
- Engineering time: 1 to 2 days
- $200+ before writing retrieval code
With this process
- 15 sample documents
- Auto-generated gold set
- 4 strategies compared in parallel
- First run: under an hour
- Under $0.05 to find the winner
Pick 10 to 20 documents. Not random. Representative.
If your corpus covers multiple content types (technical specs, user guides, legal terms, support threads), grab 4 to 6 documents from each type. If it's a uniform corpus (200 similar PDFs), 15 random ones work fine.
This is the only manual step. The goal is a small slice that looks like your real data. A sample that covers the range of your corpus will produce evaluation results that hold at scale. A biased sample will not.
Shorter documents are fine. You don't need full novels. A 5-page technical spec gives the chunker enough material to work with and keeps gold set generation fast.
A gold set is a list of test questions, each paired with the document chunk that should appear in the top retrieval results. It's the benchmark you need, built from your content, not sourced from a generic dataset.
Building one manually is slow and biased toward the queries you think of, not the ones your users actually ask. Automated generation takes about 60 seconds.
The process: your uploaded documents get chunked, each chunk goes to an LLM with a prompt asking it to write realistic questions a user would ask to find that content, and the source chunk is recorded as the expected result. For 15 documents you end up with 25 to 40 questions covering factual lookups, conceptual questions, and procedural queries.
25 to 40 questions is enough to differentiate strategies with real statistical signal. The accuracy gap between a good config and a bad one is usually 15 to 25 points. That shows up clearly at this scale.
Select the strategies you want to compare. A strategy is an embedding model combined with a chunking method and a retrieval mode. Run at least four:
Each strategy runs your full gold set independently. Strategies that share the same model and chunking config reuse embeddings, so comparing vector vs. hybrid search on the same base doesn't cost you twice.
Total embedding cost for this run: under $0.05 for 15 documents.
Each strategy returns Recall@K, MRR, latency, and cost. Here's what actually matters.
| Metric | What it tells you | Act on it when |
|---|---|---|
| Recall@K | Did the right chunk appear in the top K? | Gap > 5 points between strategies |
| MRR | Is the right chunk ranked near the top? | High recall but low MRR means it's buried |
| Latency | Query time per request | Real-time product with < 200ms SLA |
| Cost / 1M | API cost per million tokens embedded | > 10M tokens/month where it compounds |
A 1 to 2 point difference between strategies is noise. A 10+ point gap is a real signal. If text-embedding-3-large scores 0.91 MRR and text-embedding-3-small scores 0.90, pick the cheaper one and move on. If the gap is 0.91 vs. 0.76, the larger model is worth the 6.5x price difference.
Pick the strategy that wins on the metrics your use case weights most heavily. User-facing search: prioritize MRR and latency. Knowledge base chat: Recall@K. Cost-constrained at scale: the best model within your budget per query.
Then embed your full corpus. Once. With the configuration you validated on real data. Not the one that ranked best on a benchmark built from Wikipedia.
Every re-embed you skip by measuring first is real money and real time. A 100,000-document corpus at text-embedding-3-large costs $6.50 per run. Six iteration cycles without this process is $39 in API fees plus the engineering hours. That's the math on a small corpus. It scales from there.
Upload a sample, generate a gold set, compare models. Results in minutes.
Open RAG LabWhat Good Results Look Like
After running your four strategies, you're looking for a clear winner, a trade-off decision, or confirmation that the cheap option is good enough.
A clear winner looks like: one strategy leads across all four metrics by a margin that matters. Pick it and ship.
A trade-off looks like: text-embedding-3-large scores 8 MRR points higher, but at 6.5x the cost. At your current query volume, that's an extra $800/month. Is an 8-point MRR improvement worth $800/month to your users? That's a product decision, not a technical one. You now have the data to make it.
Confirmation looks like: text-embedding-3-small scores within 2 points of text-embedding-3-large. Use the cheap one. You just saved yourself from a 6.5x cost premium that wasn't buying you anything.
The most common outcome: text-embedding-3-small is within 3 to 5 points of text-embedding-3-large on most general-purpose corpora, and the hybrid strategy outperforms pure vector search by more than the model difference. The retrieval mode matters more than the model in most cases.
When to Repeat the Process
Run it again when: your corpus structure changes significantly, you add a new content type, you're considering a model upgrade, or your users start complaining about retrieval quality.
The second run is faster. Your sample process is already defined. Gold set generation takes 60 seconds. Model runs take a few minutes. An afternoon the first time, 30 minutes on subsequent runs.
The alternative is re-embedding your full corpus every time you're uncertain. That's the $39-to-$200 loop. The sample-first process breaks out of it permanently.
Upload your sample documents and RAG Lab handles the rest: gold set generation, parallel model runs, and side-by-side metric output. The whole process takes less time than reading another benchmark comparison that doesn't apply to your data.
Find your best model now