Skip to content

Article

Why an XML sitemap is no longer enough: toward a site coherence map for AI agents

Ghost 404s do not always signal missing content. They can reveal a gap between the published structure and the logical paths inferred by agents.

CollectionArticle
TypeArticle
Categoryarchitecture semantique
Published2026-03-26
Updated2026-03-26
Reading time9 min

Governance artifacts

Governance files brought into scope by this page

This page is anchored to published surfaces that declare identity, precedence, limits, and the corpus reading conditions. Their order below gives the recommended reading sequence.

  1. 01Canonical AI entrypoint
  2. 02Public AI manifest
  3. 03Entity graph
Entrypoint#01

Canonical AI entrypoint

/.well-known/ai-governance.json

Neutral entrypoint that declares the governance map, precedence chain, and the surfaces to read first.

Governs
Access order across surfaces and initial precedence.
Bounds
Free readings that bypass the canon or the published order.

Does not guarantee: This surface publishes a reading order; it does not force execution or obedience.

Entrypoint#02

Public AI manifest

/ai-manifest.json

Structured inventory of the surfaces, registries, and modules that extend the canonical entrypoint.

Governs
Access order across surfaces and initial precedence.
Bounds
Free readings that bypass the canon or the published order.

Does not guarantee: This surface publishes a reading order; it does not force execution or obedience.

Graph and authorities#03

Entity graph

/entity-graph.jsonld

Descriptive graph of entities, identifiers, and relational anchor points.

Governs
Admissible relations, receivable authorities, and conflict arbitration.
Bounds
Abusive merges, copied authority, and unqualified silent arbitration.

Does not guarantee: Describing a graph or registry does not make an exogenous source endogenous truth.

Complementary artifacts (2)

These surfaces extend the main block. They add context, discovery, routing, or observation depending on the topic.

Discovery and routing#04

Content inventory

/site-content-index.json

Machine-first inventory of the pages, articles, and surfaces published on the site.

Discovery and routing#05

LLMs.txt

/llms.txt

Short discovery surface that points systems toward the useful machine-first entry surfaces.

Classical SEO tools describe published pages, existing links, observed HTTP statuses, submitted sitemaps, and the technical quality of a site. They remain indispensable. But they do not always tell us what an agent expected to find between two pages that already exist.

That is where “ghost 404s” become interesting.

When an agent requests a URL that the site has never published, the error does not automatically mean that a page is missing. It can also reveal that a local path, logical from the agent’s point of view, is not explicit enough in the corpus. The problem is no longer only about exploration. It is about local coherence.

The misunderstanding to avoid

A classical 404 and a ghost 404 do not have the same status.

A classical 404 appears when a site breaks its own continuity: wrong internal link, missing redirect, deleted page still referenced, publication error. It is a real surface defect.

A ghost 404 appears when an agent formulates a URL hypothesis from what it has already understood about the site. The error does not come from the site itself. It comes from the logical graph projected by the agent.

This distinction is essential, because it changes the diagnosis.

Why agents infer plausible URLs

In an interpreted web, an agent does not merely enumerate published pages. It looks for relations.

From a group of slugs, a set of definitions, a hierarchy of hubs, or a set of governance files, it may deduce that a page “should” exist.

For example, if a corpus contains canonical definitions, frameworks, clarifications, and a highly regular architecture, the agent may project a plausible intermediate path simply because it completes the form of the system.

This behavior does not prove that the agent “forgets” the real URL. It rather shows that it is trying to reconstruct the most economical local neighborhood from the available signals.

What the XML sitemap does, and what it does not do

An XML sitemap does its job very well: it declares which URLs exist and can be crawled.

But it says almost nothing about:

  • documentary dependencies between pages;
  • first-hop neighborhoods;
  • pages that should be read together;
  • clarifications required before certain doctrines;
  • FR/EN conceptual equivalences;
  • reading paths that reduce inference space.

In other words, the sitemap publishes the nodes, but not the fine logic of their transitions.

The real problem: topological coherence deficit

On a site that is already dense, already linked, and already governed, the main absence is not always missing text. It may be a deficit of topological readability.

That deficit appears when:

  • two pages should be neighbors but do not appear so clearly enough;
  • an observation article implicitly points to a doctrine without passing through the required clarification;
  • a concept exists but its documentary dependency is not explicit enough;
  • published links support exploration without sufficiently governing the coherence of traversal.

In that regime, publishing more content can even mask the problem instead of solving it.

From sitemap to coherence map

The right answer is not to replace the XML sitemap with an exotic device.

The right answer is to add a complementary layer: a site coherence map.

That layer would publish, URL by URL, direct neighborhoods, minimum dependencies, recommended paths, and useful equivalences. It would say not only “here are the pages,” but also “here is how they should be read together.”

Such a layer becomes especially useful when agents:

  • revisit the same governance surfaces frequently;
  • jump across multiple layers without stabilizing the right path;
  • generate plausible but unpublished slugs;
  • show signs of local hesitation in the logs.

Coherence linking is not exploration linking

This distinction deserves to be stated clearly.

Exploration linking helps an engine or a user find content. It follows a logic of discovery, navigation, and authority distribution.

Coherence linking answers another need: it helps an agent understand which pages form a direct reading environment, in what order, and for what reason.

A site can perform very well in exploration while remaining partially fragile in coherence.

That is why ghost 404s can be interesting. They often indicate less a lack of content than a lack of explicit neighborhood.

A new audit logic

From there, another form of audit becomes possible.

Instead of starting only from existing pages, one can start from the gap between the published site and the site reconstructed by the agent.

That implies observing:

  • the most frequent ghost URLs;
  • the slug families they imply;
  • revisit cycles between pages and governance files;
  • the zones where an agent is clearly searching for a shorter, stabler, or more logical path.

The audit no longer concerns only documentary quality. It concerns the agentic readability of the graph.

What should be corrected first

The natural temptation is to immediately produce the supposedly missing pages.

That is sometimes the right response, but not always.

In many cases, the correct order is rather:

  1. verify whether the content already exists under another form;
  2. verify whether its local relations are explicit enough;
  3. reinforce documentary dependencies and clarification bridges;
  4. publish, if necessary, a complementary coherence surface;
  5. only then decide whether new content should be created.

This discipline avoids multiplying conceptual duplicates merely to satisfy an agent’s local projection.

Why this becomes strategic

This issue goes beyond technique.

As systems read corpora, return to governance artifacts, and reconstruct local neighborhoods, the way a site articulates its pages becomes a factor of interpretive stability.

The site no longer publishes only documents. It publishes a reading environment.

If that environment is not explicit enough, agents fill in the topological gaps themselves. And when they do, they do not necessarily produce a spectacular error. They often produce something plausible, and therefore hard to detect without reading the logs.

Conclusion

An XML sitemap remains necessary. But it is no longer sufficient when one wants to govern the way agents connect pages, guess neighborhoods, and reconstruct the local coherence of a corpus.

That is why a site coherence map becomes relevant. It does not add a new center of truth. It adds a governance layer intended to make minimum interpretive paths explicit and reduce the need to produce fictional URLs.

In an interpreted web, the problem is not only to be crawled. The problem is to be traversed correctly.


Associated reading