Why Your RAG Got Worse After Switching Embedding Models (And How to Fix It)

TL;DR

Switching embedding models rewrites your entire vector space. A model that benchmarks better on MTEB may retrieve worse on your documents. Before re-embedding your full corpus, pull 15 representative documents, auto-generate a gold question set, and run both models side by side. Compare Recall@K and MRR. Only commit to the new model if your data says it wins.

You upgraded from text-embedding-ada-002 to text-embedding-3-large. The new model scores higher on MTEB. You re-embedded your corpus. Retrieval got measurably worse. Users are noticing. You have no clear explanation for why a better model produced worse results.

This is one of the most common failure patterns in production RAG systems. It happens precisely because teams treat an embedding upgrade like a software dependency bump: swap the version, redeploy, expect improvement. Embeddings do not work that way.

Why the Upgrade Broke Things

Every embedding model maps text to a different geometric space. The vector for "refund policy" in ada-002's space has a specific position relative to other vectors. In text-embedding-3-large's space, it has a completely different position. The two spaces are not comparable. They are not compatible. They cannot be mixed.

When you re-embed your corpus with a new model, you are not improving your existing vector database. You are replacing it with an entirely different one built on different geometric assumptions. Queries that found the right chunks before are now searching in a space where proximity means something different.

Two models produce two incompatible vector spaces. Switching models means re-learning all distances.

Do not partially re-embed. If you switch models, every document and every query must use the same model. Mixed-model vector databases produce unpredictable retrieval because similarity scores are not comparable across spaces.

The deeper problem is that "better on MTEB" does not mean "better on your corpus." MTEB measures performance across 56 tasks drawn from Wikipedia, Common Crawl, and academic datasets. If your corpus is legal contracts, internal engineering documentation, or domain-specific support tickets, the MTEB results have limited predictive power for your use case.

A model can rank first on MTEB and underperform text-embedding-3-small on your specific documents. The benchmark does not know your data exists.

Why the Common Fixes Do Not Work

Re-embedding again with an even newer model. This continues the same loop. Without measuring retrieval quality against your actual queries, you have no way to know whether the new embedding space is better for your content. You are spending money on blind experiments.

Rolling back to the old model. Rolling back stops the bleeding but tells you nothing. You have learned that the new model is worse on your corpus, but you do not know by how much, on which query types, or whether a different configuration of the new model might have worked.

Adjusting the similarity threshold. Threshold tuning is a symptom fix. If the model is placing relevant chunks farther from queries in the new space, lowering the threshold returns more results but does not fix precision. You get more noise, not better signal.

Switching to a hybrid search approach after the fact. Adding BM25 on top of a broken dense retrieval configuration can recover some quality through keyword matching, but it masks the underlying issue and adds complexity you may not need if you had measured first.

All of these fixes share the same flaw: they apply changes to a full production corpus without first measuring what actually went wrong. The correction comes before the diagnosis.

The Correct Workflow

Before any model switch, run a controlled comparison on a sample of your actual documents. This takes under an hour and costs less than a cent. It tells you whether the new model will improve retrieval on your corpus before you spend time and money on a full re-embed.

The safe upgrade workflow: measure on a sample before committing to a full re-embed.

Pull 15 representative documents from your corpus. Cover the content types your users actually query.

Auto-generate a gold question set. An LLM writes realistic questions for each chunk and records which chunk should surface. Takes 60 seconds.

Run your current model and the candidate model against the same gold set simultaneously.

Compare Recall@K and MRR. If the new model leads by more than 3 to 5 points on both metrics, the upgrade is worth it. If not, it is not.

Re-embed the full corpus once, with the model your data chose.

Step 3 is where most teams skip to step 5 without stopping. The sample comparison is not expensive. A 15-document gold set evaluation runs in minutes. The full re-embed it prevents, if the model turns out to be the wrong choice, takes hours and costs real money.

Run the comparison with at least two chunking strategies per model, not just one. A chunk size that worked well with your old model may not be optimal for the new one. The model and the chunking config are evaluated together, not independently.

Reading the Comparison Results

After running both models against your gold set, you are looking at four numbers per strategy: Recall@K, MRR, latency, and cost.

Signal	What it means	What to do
New model wins by 5+ points	Real improvement on your corpus	Re-embed the full corpus with the new model
New model wins by 1 to 2 points	Noise, not a real signal	Keep the old model, save the re-embed cost
New model loses on MRR, wins on Recall	Finds the right chunk but ranks it lower	Consider adding a reranker before switching models
Old model wins across the board	The upgrade was a regression for your data	Stay on the old model, benchmark again before any future switch

The most common outcome is not a clear winner. It is a 1 to 3 point difference that does not justify the cost and risk of a full re-embed. Knowing this before you re-embed is the whole point of the comparison.

Compare your old and new model before committing

Upload 15 documents, run both models side by side, see which wins on your data

Open RAG Lab

What to Do If You Already Switched

If you have already re-embedded and retrieval is worse, the rollback question is straightforward: do you have the old embeddings stored, or do you need to re-generate them?

If you stored the old embeddings, re-evaluate the old model against your gold set before rolling back. You want confirmation that the old model actually performs better, not just an assumption based on memory of how things used to feel.

If you did not store the old embeddings, re-generate them on your 15-document sample first. Run the comparison. If the old model wins, re-embed the full corpus with it. You will spend the same money re-embedding either way. The comparison at least ensures you spend it on the right model.

Going forward, save your evaluation results alongside your embedding configs. A model choice without its Recall@K score attached is a guess. The score is what makes the decision defensible.

Test before you re-embed

Upload 15 representative documents, generate a gold set automatically, and run your old and new models side by side. You get Recall@K, MRR, latency, and cost per strategy in minutes. The comparison costs less than a cent. The re-embed you avoid could save hours.

Run the comparison now

Why the Upgrade Broke Things

Why the Common Fixes Do Not Work

The Correct Workflow

Reading the Comparison Results

What to Do If You Already Switched

Related Articles