MTEB Won't Tell You Which Embedding Model to Use

TL;DR

MTEB scores don't predict retrieval quality on your corpus. The model choice is secondary to your task type, chunking config, and retrieval mode. Start with text-embedding-3-small, measure it against your actual documents and queries, and only pay for a larger or more specialized model if the numbers justify it.

text-embedding-3-large scores 64.6 on the MTEB leaderboard. text-embedding-3-small scores 62.3. That 2.3-point gap covers 56 tasks across Wikipedia articles, Common Crawl text, and MS MARCO passage retrieval.

Your corpus is legal contracts. Or technical support tickets. Or internal engineering documentation full of product-specific terminology that does not appear in any of those datasets.

The leaderboard does not know that. And the model you picked based on it might be the wrong one for your retrieval task. A model that ranks 4th on MTEB might rank 1st on dense technical documentation. The first-place model might underperform text-embedding-3-small on code retrieval. General benchmarks measure general performance. Your retrieval task is not general.

No public leaderboard can tell you which model works on your documents with your queries. That number can only come from your actual data.

What You Are Actually Choosing Between

Embedding model selection has four dimensions worth paying attention to. Leaderboard rank is not one of them.

Dimensionality

Most hosted models output vectors between 768 and 3072 dimensions. Higher dimensions can capture more semantic nuance, but the gains plateau quickly for standard text retrieval. Beyond a threshold, you are paying for storage cost and slower approximate nearest neighbor queries, not meaningfully better results.

Matryoshka representation learning changes this calculation. Models trained with MRL produce vectors you can truncate to a smaller size without proportional quality loss. Both text-embedding-3-small and text-embedding-3-large support MRL: you can cut text-embedding-3-small from 1536 to 512 dimensions and lose roughly 2 to 4 MRR points while saving two-thirds of your storage cost and speeding up ANN queries. Whether that tradeoff works for you depends on your accuracy requirements, which you need to measure.

Effective Context Window

The advertised context window is not the effective context window. text-embedding-3-small technically accepts 8191 tokens, but embedding quality degrades past 512 to 600 tokens per chunk. The model tries to represent too much at once, and the resulting vector loses the specificity that makes retrieval work.

If your chunking strategy produces 1024-token chunks, you are not getting 1024-token quality from the embedding. You are getting a diluted vector that retrieves less precisely. This is a chunking problem that switching to a larger model will not fix.

Cost

text-embedding-3-small costs $0.02 per million tokens. text-embedding-3-large costs $0.13 per million tokens: 6.5x more expensive. Open models like gte-large and e5-large cost $0 at the API level, but you take on infrastructure to run them. At 10 million tokens per month, that gap is $200 vs. $1,300 vs. some fraction of a GPU instance.

This math only matters once you know which model actually performs better on your corpus. Making the cost decision before measuring performance is how teams end up paying 6.5x for a model that scores 1 MRR point higher than the cheap one.

Latency

Document embedding happens offline, so batch latency rarely matters. Query embedding is different: it runs on every request. For applications targeting sub-100ms total latency, the embedding API call is often the bottleneck. Self-hosted models add no network roundtrip. For high-throughput real-time applications, that matters more than the quality difference between small and large models.

The Models, Briefly

Here is a practical read on each major option, without the benchmark theater.

Model	Cost / 1M tok	Dims	When to use it
text-embedding-3-small	$0.02	1536 (MRL)	Default. Fast, cheap, solid on general text.
text-embedding-3-large	$0.13	3072 (MRL)	Dense technical content where precision matters.
gte-large	$0 API	1024	Self-hosted at scale. Competitive quality, no per-token cost.
e5-large-v2	$0 API	1024	Self-hosted. Strong on instruction-tuned retrieval.
pplx-embed models	varies	2048+	Long-context corpora where full-document embedding matters.

Task Type Drives More Than Model Choice

This is the part that benchmark comparisons consistently miss. The embedding model you pick matters less than the retrieval task type, and confusing the two leads to expensive experiments that don't move the needle.

Semantic queries on general text

"Show me articles about reducing API latency." Dense vector retrieval works well here. The query and documents share semantic structure, and the embedding model is the primary quality driver. This is where model benchmarks are most predictive, and also where they're most likely to hold up on your data.

Keyword-heavy exact match

"Section 4.2(b) of the master services agreement." Your query embedding will be semantically similar to many contract sections. The literal string match is what you need. No embedding model fixes this. Hybrid search with BM25 keyword matching does. If this describes your queries, the model choice is nearly irrelevant compared to the retrieval mode.

Domain-specific terminology

Product-specific error codes, proprietary drug names, internal engineering jargon. These terms may not appear in any model's training distribution, which means the embedding cannot place them correctly in semantic space. General MTEB benchmarks won't predict domain-specific failure because they don't include your terminology. Measuring on your documents will.

Multi-lingual corpora

OpenAI models underperform specialized multi-lingual models on non-English text by significant margins. If your corpus includes French legal documents, Spanish customer support, or mixed-language technical content, e5-multilingual or Cohere's multilingual models are worth testing before defaulting to text-embedding-3-small.

Code retrieval

Function signatures, variable names, docstrings, and inline comments form a semantic structure that general-purpose models handle poorly. Models with training exposure to code repositories retrieve relevant functions more reliably than models trained on natural language. If code is a significant part of your corpus, test a code-specific model before committing to a general one.

Before testing models, classify your retrieval task. If it's keyword-heavy, add hybrid search first. If it's domain-specific, prepare for benchmark scores to not translate. If it's code, test a specialized model. The task type determines which experiments are worth running.

The Decision Process

Here is the shortest path to a defensible model choice.

Start with text-embedding-3-small. It is fast, cheap, and reliable on general English-language retrieval. For most teams, most of the time, it is the right answer. You do not need to justify picking the cheaper option when the quality is competitive.

Test text-embedding-3-large if precision matters and cost is secondary. Dense technical documentation, legal text, and highly specialized domains tend to show measurable quality differences between small and large models. Measure the difference on your data. If it's more than 3 to 4 MRR points, the 6.5x cost increase might be worth it. If it's 1 point, it isn't.

Consider open models when you're paying real money for API calls. At 50 million tokens per month, text-embedding-3-large costs $6,500. gte-large costs the price of a GPU instance. The quality tradeoff is real but often smaller than the cost gap. Self-hosting also eliminates network latency for query-time embedding.

Use Matryoshka truncation when storage or ANN speed is a constraint. Reducing text-embedding-3-small from 1536 to 512 dimensions cuts storage by two-thirds and speeds up retrieval. Measure the accuracy penalty before assuming it's acceptable. On most corpora it's modest. On precision-demanding tasks it may not be.

The model selection that matters is the one that holds up on your corpus. Guidelines narrow the field. Your data decides. Any model choice made without measuring retrieval quality against your actual documents and queries is a guess, not a decision.

Compare models on your own documents

Select different embedding models and chunking strategies to see how they score on your corpus

Open RAG Lab

Reading the Output

When you compare models, you get four numbers per strategy. Here is what each one tells you.

Recall@K answers the binary question: did the correct chunk appear in the top K results? Averaged across your gold question set, it tells you how often your pipeline surfaces the right content. A 10-point Recall@5 difference between models is meaningful. A 1-point difference is noise unless your gold set is large.

MRR (Mean Reciprocal Rank) rewards ranking the correct chunk higher. Rank 1 scores 1.0, rank 2 scores 0.5, rank 5 scores 0.2. A strategy with 0.80 Recall@5 but 0.45 MRR is surfacing the right chunk somewhere in the top 5, but burying it. That matters if your LLM only uses the top 2 to 3 retrieved chunks for generation.

Latency and cost close the comparison. A model that scores 0.87 MRR with 280ms average embedding latency is not straightforwardly better than a model with 0.84 MRR and 40ms latency for a real-time product. The 6.5x cost difference compounds at scale. These four numbers together tell you whether the quality difference justifies the price.

The Short Version

Pick text-embedding-3-small as the default. Identify your retrieval task type before testing models, because the task type determines whether model choice even moves the needle. Measure on a sample of your actual documents with realistic queries. Only upgrade to a larger or more specialized model when the numbers show it's worth it.

The benchmark you need is not on any leaderboard. It is sitting in your document store.

The benchmark is your data

Upload 10 to 20 representative documents. RAG Lab generates the gold questions, runs your chosen models side by side, and returns Recall@K, MRR, latency, and cost per strategy. The whole experiment costs under a cent to embed.

Run the comparison