Identifier governance: multigraph disambiguation and machine-first anchoring
Entity collisions, neighborhood contamination, and interpretive capture almost always have a structural cause: identity is carried by signals, not by identifiers.
In an interpreted web, a name is not an identifier. A profile is not proof. A link is not a relation. This framework formalizes a discipline of persistent identity to stabilize an entity across multiple graphs (site, aggregators, databases, RAG, agents).
Operational definition
Identifier governance: set of rules aimed at defining, publishing, and maintaining persistent identifiers and disambiguation mappings across graphs in order to reduce collisions, limit out-of-perimeter inference, and make identity auditable.
Why this is essential
- A name can be shared by multiple entities (homonymy).
- A single entity can have variants (spelling, language, branding).
- AI systems infer by neighborhood when they lack stable identifiers.
- RAG environments can merge entities if documents are not anchored.
The goal is not to “make a model understand”. The goal is to anchor the entity persistently.
Application surfaces
- Open web: response engines, external databases, aggregators.
- RAG: chunking, routing, citations, vectors.
- Agentic: execution and decisions on provable identity.
Identifier types
- Canonical on-site identifier: stable entity page URL + persistent @id.
- External identifiers: profiles, databases, directories, registries.
- Documentary identifiers (RAG): docId, version, source, author, date.
- Relation identifiers: parent/subsidiary, sameAs, isBasedOn, relatedTo.
Framework rules (GID-1 to GID-10)
GID-1: a unique canonical identifier
Each entity must have a stable canonical identifier (URL + @id).
GID-2: separation of name vs identity
The name can change. The identifier must remain stable.
GID-3: explicit variant mapping
Declare variants (languages, acronyms, former names) as variants of the same entity.
GID-4: declared exclusions
Explicitly declare “what the entity is not” when homonymy is plausible.
GID-5: structured relations
Make relations explicit (subsidiary, founder, product, division) to prevent implicit fusions.
GID-6: endogenous coherence
The site must always point to the same identifier (no internal contradictions).
GID-7: exogenous coherence
Correct dominant external sources that use erroneous identifiers.
GID-8: RAG anchoring
Each chunked document must retain a source identifier, a version, and a relation to the entity.
GID-9: identity proof
On critical attributes, require a fidelity proof that includes the identifier, not just text.
GID-10: monitoring and regression
Periodically test collisions and verify that identifiers remain coherent after release.
Implementation process
- Define the entity and create its canonical identifier.
- Create an internal disambiguation page if necessary.
- Declare variants and exclusions.
- Structure relations (internal graph).
- Map external identifiers and correct divergences.
- In RAG, attach each document to the entity identifier.
- Test multi-AI and monitor collisions.
Expected artifacts
- Identifier registry (canonical + external).
- Variant and exclusion table.
- Entity relation map (internal graph).
- Multigraph mapping (dominant sources, statuses).
- Test battery (collisions, substitutions, contaminations).
FAQ
Why is this not just “sameAs”?
Because governance includes exclusions, relations, versions, and RAG/agentic implementation.
What most frequently breaks identifiers?
URL migrations, rebrands, duplicate pages, and uncorrected aggregators.
What is the main benefit?
Drastically reducing collisions and making identity provable, therefore governable.