RAG contamination is not a bug: it is a property of the system

Type: Article (interpretive risk)

Conceptual version: 1.0

Stabilization date: 2026-02-28

In RAG, corpus contamination is not a peripheral accident. It follows directly from the architecture: retrieval turns fragments into contextual authority.

Many teams talk about RAG contamination as if it were a bug: “a bad document slipped in,” “one chunk was poor,” “the index retrieved something absurd.” That reading is reassuring because it suggests a local fix. In reality, a RAG architecture deliberately inserts a retrieval chain that makes recalled fragments actionable inside the response.

RAG fails not only when it retrieves poorly. It fails when the system grants implicit authority to what it retrieves, especially when the corpus is heterogeneous, weakly bounded, or easy to contaminate.

RAG: an architecture that manufactures contextual authority

A RAG system works in two steps:

retrieve fragments from an index: documents, pages, chunks, metadata
generate a response that integrates those fragments as context.

That integration is not neutral. A retrieved fragment is not only “read”; it is often treated as relevant, sometimes as proof, and frequently as a basis for the answer itself. That is why contamination is systemic: the retrieval chain is also an authority chain.

Contamination: three dominant mechanisms

1) Reference drift. Non-canonical sources surface as if they were preferable simply because they match semantically, repeat more often, or chunk more cleanly. The system then cites, summarizes, or stabilizes references that should not carry authority.

2) Sticky fragments. Some fragments are unusually reusable: generic formulations, vague definitions, procedural language, disclaimers. They reappear across contexts and become a recurring bias.

3) Recall instability. Slightly different prompts retrieve different fragments. The answer becomes variable or contradictory, not because the model suddenly hallucinates more, but because the recalled context is not stable.

Why this cannot be solved by “a better filter”

Filtering helps against obvious pollutants. It does not solve the deeper issue: a retrieved fragment may look clean while still being illegitimate as an authority source. A fragment can be syntactically acceptable and doctrinally dangerous.

That is why contamination is not only a content problem. It is a governance problem: which documents may be recalled, under which conditions, with which rank, and with which abstention rule when retrieval is not sufficient.

Corpus governance: the real perimeter

In RAG, the real perimeter is not the model alone. It is the whole ingestion, indexing, retrieval, and ranking surface. Corpus governance therefore means:

declaring which sources are canonical, secondary, or excluded
ensuring retrieval does not flatten those ranks
detecting contradictions instead of hiding them inside synthesis
allowing legitimate non-response when recalled material is insufficient or conflicting.

RAG and interpretive risk

RAG becomes an interpretive-risk problem when the system starts answering with authority it does not actually possess. At that point, contamination is no longer a quality issue. It becomes a liability issue: a fragment recalled as context is treated as justification for a response that may later be used, quoted, or contested.

Doctrinal links

Conclusion

RAG contamination is not a side-effect to be patched once in a while. It is a structural property of any architecture that turns recalled fragments into contextual authority. The real issue is therefore not only whether the corpus is “clean,” but whether the system is governed strongly enough to prevent retrieval from manufacturing illegitimate truth.