Our RAG Got Worse Without Any Code Changes

TL;DR

RAG quality can fall 30 points without a single line of code changing. The culprit is almost always the corpus: new documents shift the embedding distribution, old documents go stale, or query patterns outgrow the content that exists. Catching it manually requires calculating centroid distances, storing baseline versions, and tracking distribution divergence over time. Most teams skip all of that until users notice. There is a faster path.

A RAG system we were testing started with ~95% recall.

Same documents. Same pipeline. No code changes.

A few weeks later it had dropped to ~65%.

Nothing was failing. No errors. The LLM still sounded confident.

The only thing that changed was time.

That 30-point drop did not show up in any alert. There was no deployment that broke things, no error rate spike, no latency regression. The system looked healthy by every standard metric while quietly returning wrong answers to real users. Every one of those sessions is engineering credibility you spend and cannot get back.

The code did not change. The retrieval got worse. Those two things should not be able to coexist, but they do, regularly, in production RAG systems. Understanding why requires looking at what actually changed: not the code, but the data the code runs on.

What Actually Causes This

A RAG pipeline has two moving parts: the retrieval system and the corpus it searches. Most teams monitor the retrieval system obsessively and ignore the corpus entirely. The corpus is where most silent regressions live.

There are three common patterns:

Corpus drift

New documents were added over time and they represent a different type of content than what existed at launch. The embedding distribution shifted. Queries written against the old corpus now land in the wrong neighborhood.

Document staleness

Old documents contain information that is no longer accurate. The retrieval system still surfaces them because they are semantically close to the query. The content is technically retrieved correctly. It is just wrong now.

Query pattern shift

Users started asking questions your corpus was never designed to answer. The semantic gap between queries and available content widened. Recall dropped not because the retrieval broke but because the match was never there to make.

Three corpus changes that silently degrade retrieval quality without touching a line of code

All three of these happen gradually. There is no moment where something breaks. The quality slope is shallow enough that no single day looks like a problem, and by the time the aggregate degradation is obvious, it has been accumulating for weeks.

The most dangerous RAG failure mode is not a crash. It is a slow retrieval regression that no alert catches because nothing in your monitoring stack is watching the data, only the code.

The Manual Approach: What Catching This Actually Takes

If you want to detect corpus-driven quality regression before your users do, you need three things: a measurement of how your embedding distribution changes over time, a baseline to compare against, and a trigger that runs that comparison automatically.

Here is what building that yourself looks like.

Step 1: Calculate centroid distance

The centroid of your embedding space is the average vector across every document in your corpus. When you add a large batch of new content, the centroid moves. If it moves far enough, the retrieval geometry that worked before is now operating on different assumptions.

To measure this, you compute the mean embedding of your corpus at time A, compute it again at time B, and calculate the cosine distance between those two centroids. A small shift (under 0.05) is normal. A shift above 0.15 or 0.20 usually correlates with a measurable retrieval quality drop on queries tuned to the original corpus.

python

import numpy as np
def centroid_distance(embeddings_v1, embeddings_v2):
    c1 = np.mean(embeddings_v1, axis=0)
    c2 = np.mean(embeddings_v2, axis=0)
    # cosine distance
    dot = np.dot(c1, c2)
    norm = np.linalg.norm(c1) * np.linalg.norm(c2)
    return 1.0 - (dot / norm)
shift = centroid_distance(baseline_embeddings, current_embeddings)
if shift > 0.15:
    print(f"Centroid shifted {shift:.3f} - retrieval geometry changed")

The problem is that running this requires you to have stored your baseline embeddings somewhere. Most teams do not. When the shift happens, you have no reference point. You know something changed; you cannot measure by how much.

Step 2: Track norm shift

Centroid distance tells you where the average vector went. Norm shift tells you something different: whether the average magnitude of your vectors changed. This matters because if your documents are getting longer on average, or if you changed your preprocessing pipeline, the norms will drift even if the semantic direction of your corpus stays roughly the same.

High norm shift combined with low centroid shift usually means document length or preprocessing changed. High centroid shift with stable norms usually means the semantic content of your corpus changed. Both patterns are worth catching. They point at different problems.

Step 3: Measure distribution divergence

Jensen-Shannon divergence measures how different two probability distributions are. Applied to your embedding space, it tells you whether the overall shape of your corpus distribution changed, not just the center. Two corpora can have the same centroid but very different distributions: one clustered tightly around a single topic, the other spread across ten.

A JS divergence under 0.1 between versions indicates healthy continuity. Above 0.3 typically means the corpus added content that is semantically distant from what existed before. That is not always bad, but it needs to be caught so you can evaluate whether your retrieval config still makes sense.

Centroid shift, norm shift, and JS divergence each catch different failure modes. Centroid shift catches directional drift. Norm shift catches scale drift. JS divergence catches structural drift. You need all three to know your corpus is healthy.

Step 4: Store versioned baselines

All of the above is only useful if you have something to compare against. That means every time you update your corpus, you need to snapshot your embeddings, store the metadata about that snapshot, and keep the history long enough to be useful.

In practice this means writing a versioning layer on top of your vector database, a storage strategy for the embedding snapshots (which can be large), a retrieval mechanism to pull any historical version on demand, and a comparison pipeline that runs the checks automatically.

Most teams are not going to build this. The ones that do spend two to four weeks on it and end up with something brittle that they stop maintaining after the first major corpus migration. The problem is not that the approach is wrong. It is that building it from scratch is expensive and the maintenance burden compounds.

Step 5: Wire a trigger to run checks automatically

Even if you build all of the above, it only catches regressions if it actually runs. Manually kicking off a comparison after every corpus update requires someone to remember to do it. In practice, nobody does, especially not on a Friday when the document sync happens to run at 11pm.

You need the checks to run automatically on every version change, produce a clear pass or fail, and surface failures somewhere a human will actually see them before the next deployment goes out.

A quality check that runs only when someone remembers to trigger it is not a quality check. It is a ritual that makes everyone feel better until the regression happens between rituals.

What the full monitoring pipeline looks like end to end: from corpus change to actionable alert

What This Costs to Build and Maintain

Building it yourself

Versioning layer on vector DB: 1 to 2 weeks
Baseline snapshot storage + retrieval: 3 to 5 days
Centroid / norm / JS divergence pipeline: 2 to 3 days
Auto-trigger on corpus change: 1 to 2 days
Dashboard to surface results: 3 to 5 days
3 to 5 weeks before you catch your first regression

With Decompressed

Versioning: automatic on every sync
Baseline storage: every version retained
Centroid / norm / JS checks: built-in Quality tab
Auto-trigger on version create: one toggle
Pass / fail with metric values: visible immediately
First regression caught before the next deploy

How Decompressed Handles This

The approaches described above are the right ones. Centroid distance, norm shift, and JS divergence are the metrics that actually tell you whether your corpus changed in a way that affects retrieval. The problem is not the measurement. It is the infrastructure cost of running it continuously.

Decompressed is built around the assumption that your corpus is a versioned artifact, not a mutable blob. Every time you sync or upload documents, a new version is created automatically. You do not opt into versioning. It happens by default. That means the baseline always exists, going back to the first sync.

The Quality tab

The Quality tab runs the exact checks described above on your actual corpus. Centroid shift is measured against the previous version. Norm shift is compared version over version. JS divergence tells you how the distribution changed. Each check surfaces the raw metric value and the threshold it is measured against, so you can see not just pass or fail but by how much.

When a check fails, you see: what moved, how far it moved, and what version it moved from. That is the full diagnosis in one view. No SQL queries against your vector database. No custom scripts to pull embedding snapshots. The comparison is already done.

Centroid Shift

Healthy under 0.10

Cosine distance between the average embedding of the current version and the previous one. A shift above your threshold means the semantic center of your corpus moved.

Norm Shift

Healthy under 0.15

Change in average vector magnitude. Catches preprocessing changes and document length drift before they silently affect retrieval scoring.

JS Divergence

Healthy under 0.10

How different the overall embedding distribution is between versions. Catches structural corpus changes that centroid shift alone would miss.

Vector Validity

Zero invalid vectors

Checks for NaN, Inf, and all-zero vectors. Invalid vectors do not throw errors. They poison your index silently.

Duplicate Detection

Under 5%

Rate of near-exact embedding copies. High duplicate rates skew search results and waste index storage.

Auto-run on version create

Quality checks run automatically every time a new version of your corpus is created. You do not have to schedule them, trigger them manually, or remember to check the results. When the nightly document sync completes and a new version is written, the checks run against it. If something failed, you know before the next morning standup. Not when users start complaining two weeks later.

Version history as a testing surface

Because every version of your corpus is retained, you can test retrieval quality against any of them. If a user reports that search felt better three months ago, you can pull the corpus state from that period and run the same gold set evaluation against it. That comparison either confirms the regression or rules out the corpus as the cause. Either answer is useful.

This also means you can test configuration changes before committing them. Run the same evaluation on two different corpus versions with the same retrieval config. Or run two different configs against the same version. The version history gives you a stable reference point that does not move while you are experimenting.

Most RAG debugging happens without a baseline because teams never stored one. When every corpus state is retained by default, you always have a baseline. That changes the debugging question from "what changed?" to "when did it change and by how much?"

Connect your corpus

Decompressed versions your corpus automatically and runs centroid shift, norm shift, and JS divergence checks on every version without any setup. No baselines to store manually. No scripts to write. Connect your corpus and the Quality tab shows you the first check results in minutes.

See your corpus health

When You Get the Regression Alert

When a quality check fails, the first question is whether the shift is a problem or expected. A centroid shift of 0.08 after adding 200 new support articles is probably fine: you added content, the distribution moved slightly, and retrieval is still operating in the same general space. A centroid shift of 0.31 after what was supposed to be a routine re-index is not fine.

The threshold values are yours to set. The defaults are conservative starting points. For corpora that change frequently and intentionally (news, support articles, product documentation), you may raise the centroid shift threshold. For corpora that are supposed to stay stable (legal documents, compliance content, certified knowledge bases), you may tighten it.

When a shift is genuinely unexpected, the version history is the starting point. Pull the two versions on either side of the failure. Identify which documents were added or removed. If a single batch introduced the shift, you can evaluate whether those documents belong in the index at all, or whether they need a separate retrieval config.

A failing quality check is not always a signal to roll back. Sometimes it is a signal to update your thresholds because the corpus genuinely evolved. The check's job is to surface the change. Your job is to decide whether the change is acceptable.

The Pattern That Keeps Failing

Teams discover RAG regressions one of two ways: from a quality check, or from users. The ones who find out from users spend the next several days in reactive mode: pulling logs, running ad hoc queries, trying to reproduce complaints without a clear baseline.

The ones who find out from a quality check spend twenty minutes looking at what shifted and deciding whether to act. The total calendar time is not that different. The experience is completely different. One is controlled. One is a fire.

Building the monitoring yourself is worth it if you have the engineering capacity and you want to own the instrumentation. If you do not, you will keep finding out from users, and the cost of that is not just the support tickets. It is the trust you spend every time the system behaves worse than expected without warning.

Stop finding out from users

Decompressed runs centroid shift, norm shift, and distribution checks automatically every time your corpus changes. You get a pass or fail before your next deployment. No instrumentation to write. No baseline to manually store. The history is already there.

See it on your data