Sampling, representativeness, and comparison corpora

A doctrinal corpus that publishes comparisons quickly meets a harder question than it first appears: what counts as a good sample? As long as doctrine only illustrates a concept, the answer remains vague. As soon as it publishes public benchmarks, formalized test cases, or comparative dossiers, sampling stops being a logistical issue. It becomes a problem of doctrinal scope.

A poorly built corpus may show something locally true and still lead to a misleading generalization. Conversely, a smaller but better-bounded corpus may reveal far more useful tensions: negative cases, version conflicts, dominant third-party surfaces, lagging translations, archived states, procedural exceptions, or legitimate forms of non-response.

This page extends public benchmarks, applied observability, and doctrinal jurisprudence. It fixes a simple requirement: a publishable sample must make explicit what it covers, what it does not cover, and what type of generalization it genuinely authorizes.

1. Why sampling is a doctrinal question

In weak practice, sampling is treated as a technical detail. A few available cases are retained, a few convincing examples, a few legible screenshots, and then the series is allowed to support a conclusion that is too broad.

In a doctrinal corpus, that shortcut is dangerous. The problem is not only statistical bias. The problem is the gap between the scope of the corpus and the scope of the statement. A site may compare three cases and speak as if it had mapped a regime. It may observe a single language and speak as if it had described a general dynamic. It may retain clean cases and omit precisely the ones that make doctrine demanding.

Sampling therefore becomes doctrinal when it conditions the legitimate form of what may be said. A badly composed series is not merely less robust. It authorizes discourse that is too large for its actual perimeter.

2. Five objects that must be distinguished

a) Sample

The sample is the subset actually observed. By itself it says nothing about its quality. One still needs to know according to what rule it was built.

b) Comparison corpus

The comparison corpus is the bounded whole within which comparison is deemed valid. It often includes the observed cases, exclusions, variants, and reading conditions.

c) Stratum

A stratum designates a family of cases one wants to render visible in the corpus: multilingual states, exogenous third-party surfaces, multimodal surfaces, procedural exceptions, archive residues, non-response, version conflict, and so on.

d) Minimal fixture

A minimal fixture isolates a precise mechanism. It does not aim at overall representativeness, but at readability of a tension. It prepares the formalized test case.

e) Publishable series

The publishable series is the final set the site exposes. It need not be exhaustive, but it must be honest about its construction and scope.

Without those distinctions, a corpus easily moves from demonstrative to generalizing without declaring that shift.

3. What “representative” really means here

On a doctrinal site, representativeness is not necessarily a strong statistical ambition. It is first a requirement of covering relevant tensions.

A corpus may be relatively small and still well built if it covers several regimes that truly matter for doctrine: source variation, language variation, temporal variation, negative case, exception, active archive, dominant third-party surface, conflict between canon and reprise, or legitimate suspension.

Conversely, a large corpus may remain poor if it repeats the same type of case endlessly, in the same language, over the same perimeter, with the same level of documentary cleanliness.

In other words, a corpus is not more representative simply because it contains “more.” It is more representative when it makes visible the diversity of constraints the statement claims to cover.

4. Classic drifts of a badly composed corpus

The first drift is cherry-picking: retaining only demonstrative cases, those where the architecture looks elegant, or those where the system fails spectacularly.

The second is misleading homogeneity: comparing only clean, textual, recent, well-linked surfaces and then concluding about terrains where archives, screenshots, translations, or third-party profiles play a major role.

The third is temporal flattening: mixing historical, current, and superseded states without declaring their status.

The fourth is erasing negative cases: excluding non-responses, legitimate refusals, partial outputs, undecidable cases, or unresolved contradictions.

The fifth is rhetorical overgeneralization: speaking about a system, a regime, or an entire family when the corpus in fact bears only on a narrow stratum.

5. Minimum conditions for a publishable comparison corpus

A publishable comparison corpus should at least make visible:

the inclusion rule for cases;
the retained strata and absent strata;
the presence of negative cases and forms of non-response;
the observation window and temporal status of objects;
the relation between the corpus and the statement it is meant to support;
the revision conditions: addition, withdrawal, rectification, supersession.

Those conditions do not aim at crushing publication under method. They simply prevent a very local corpus from being read as a general judgment on a broader regime.

6. From singular case to benchmark: what counts as a healthy progression?

A doctrinally clean progression distinguishes several levels.

The limit case reveals a tension. The comparative dossier reconstructs it with its pieces and hierarchy. The minimal interpretive fixture isolates a mechanism. The benchmark or baseline integrates several objects into a series. Between those levels, sampling acts as the hinge. It decides which tensions become visible in publication.

That is why this page meets archives and rectification directly. A good corpus does not only declare its cases. It also declares how it will revise the series when a state is withdrawn, superseded, or requalified.

7. Scope and limit

This page proposes neither a universal theory of sampling, nor a total benchmark, nor a promise of exhaustiveness. It fixes a more modest requirement: whenever a doctrinal corpus compares, illustrates, or observes, it must align the construction of its corpus with the actual level of generalization it claims.

A doctrinal site may publish weak, local, and partial series. It becomes fragile only when it forgets to say that they are.

Sampling, representativeness, and comparison corpora