Detection is not legitimacy: the limits of filtering-only defenses

Type: Article (interpretive risk)

Conceptual version: 1.0

Stabilization date: 2026-02-28

Detecting injection, toxic content, or anomalies can improve security. It does not make a response legitimate. Legitimacy is a governance property: perimeter, source hierarchy, response conditions, and enforceable abstention.

In most AI-security discourse, defense is framed as a problem of detection: detect malicious prompts, detect toxicity, detect jailbreak attempts, detect suspicious content, then filter, block, or redirect. Those approaches are useful. But they only address part of the problem.

They reduce the visibility of certain patterns. They do not answer the most expensive question: when can a response be defended as legitimate, reconstructible, and enforceable when challenged?

Why detection is not enough

Detection often intervenes after confusion has already been installed: instruction versus data, source versus authority, context versus truth. Even when a filter blocks one class of attacks, the structural risk remains: the system can still produce a plausible answer outside defensible conditions.

A system can therefore be “clean” in the sense that nothing suspicious was detected, while still being illegitimate in the sense that no authorized basis, no hierarchy, no traceability, and no abstention rule support the output.

Detection: a symptom logic

Detection behaves like symptom medicine:

it observes patterns
it triggers rules
it reduces one class of visible behaviors.

But in an interpretive regime, the main exposure is not simply “a behavior to block.” It is the appearance of an assertion without legitimate grounding, which may then be reused, quoted, and treated as truth.

Legitimacy: a condition logic

Legitimacy is not a filter. It is an output contract. A response is legitimate only if minimum conditions are satisfied:

Admissibility: the topic falls within the authorized perimeter.
Authority: the response relies on admissible, ranked sources.
Traceability: the justification can be reconstructed, not merely intuited.
Proportional force: the level of assertion matches the level of proof.
Abstention: if the conditions are not met, the system can legitimately refrain from answering.

The classic trap: filtering content instead of bounding authority

Teams often try to secure content while leaving authority implicit. They harden inputs, blacklist patterns, and screen outputs, but they never declare what the system is authorized to conclude, under which hierarchy, or when it must abstain. In that configuration, a filtered system may still produce illegitimate output with perfect fluency.

What filtering does not replace

Filtering does not replace source hierarchy, contradiction handling, explicit perimeters, legitimate non-response, or reconstructible justification. It is an auxiliary control, not a doctrine of answerability.

The role of the Q-Layer

The Q-Layer matters because it makes response conditions explicit. It distinguishes “can the model say something?” from “is the system authorized to say this here, under these conditions, with this proof?” That distinction is what turns security from symptom control into governed legitimacy.

Doctrinal links

Conclusion

Detection can reduce visible danger. It cannot, by itself, establish legitimacy. A response becomes defensible only when the system is governed by explicit conditions: perimeter, authority, hierarchy, traceability, and the right to abstain.