Method / Divergence Testing

Divergence Testing — Run Data

3 runs · 20 models · 30 pairs · 600 data points · April 2026 · Atlas Heritage Systems

Divergence is signal, not error. Where models disagree is where epistemic tension lives in the training data — a record of whose framings made it into the corpus and whose didn't. The spread matrix doesn't tell you who is right. It tells you where the seams are.

What This Is

Divergence Testing asks a model ensemble to rate the semantic similarity of paired statements on a 0.00–1.00 scale. Each pair presents the same subject from two different epistemic framings — one typically Western-academic, one drawn from indigenous, oral, or non-dominant knowledge traditions. Models score independently in fresh sessions with no Atlas context.

The spread (max − min across the ensemble) measures how much models disagree. High spread on a pair means models have fundamentally different representations of whether those two framings are saying the same thing. That disagreement is the data.

Stack Position

Standalone

Rapid ensemble probing, stimulus validation, shakedown runs. Produces a spread matrix and stops. Can achieve Tier A independently.

BSA Phase 2

Feeds directly into Phase 3 (Defense Elicitation) and Phase 4 (ECM Behavioral Scoring). The spread matrix identifies which pairs get defense runs.

Bridge Experiment input

High-spread pairs from Divergence Testing are candidates for PyHessian probing — connecting behavioral signal on the outside to loss landscape geometry on the inside.

For the full protocol, see Divergence Testing Protocol v1.0.

Run History

RunModelsPairsData pointsFidelityNotes
Run 151575Tier CProof of concept. No session isolation, stimulus unversioned.
Run 2720140Tier BExpanded stimulus set. Partial isolation — some context leakage.
Run 32030600Tier BFull ensemble, versioned stimulus, fresh sessions. Missing Technician's Read #0 pre-run.

Run 3 — Summary

20 models across 8 lineages. 30 stimulus pairs across 6 categories. Flag threshold: spread ≥ 0.20 (20+ model ensemble, standard).

27/30
Flagged pairs
0.23
Avg spread — Control
0.68
Avg spread — Cross-Cultural
0.70
Avg spread — Divergence
The staircase holds

Control pairs average 0.23 spread. Cross-cultural pairs average 0.68. Divergence-detection pairs average 0.70. The gradient is consistent and steep — exactly the signal the experiment was designed to find.

Foil controls passed

F16 (Silk Road) and F17 (Ukiyo-e) — both non-Western content pairs with matching framings — scored tight across the ensemble (spread 0.12 and 0.10). The ensemble can agree on non-Western content when the framing is consistent. The divergence is epistemological, not cultural.

D15 is the high-water mark

Oral tradition (unreliable vs high-fidelity) produced the highest spread in the run at 0.87 — scores ranging from 0.05 to 0.92. Some models read these as near-identical claims about oral tradition. Others read them as near-opposites. That's not measurement noise. That's a training corpus signature.

Run 3 — Ensemble

ModelLineageAccess
Skywork ProChinese (Kunlun Tech)Native API
DeepSeek 3.2Chinese (DeepSeek)DeepAI
Qwen 3.5-maxChinese (Alibaba)Arena
Qwen3.5-122bChinese (Alibaba, larger)Arena
Cohere ExpanseMultilingual — 23 languages (Cohere)HuggingFace
IBM GraniteEnterprise (IBM)Arena
Tiny AyaMultilingual — underrepresented language focus (Cohere)HuggingFace
Z-glm-5Chinese (Zhipu AI / Tsinghua)Arena
Claude Opus 4.5Western Commercial (Anthropic)DeepAI
GPT-5.2Western Commercial (OpenAI)DeepAI
Gemini 3.1 ProWestern Commercial (Google)DeepAI
PerplexityWestern Commercial — search-augmentedNative API
Mistral-large-3French (Mistral AI)Arena
Mistral-medium-2505French (Mistral AI)Arena
Mistral-small-2603French (Mistral AI)Arena
Llama 3.3 70BOpen-weight (Meta)DeepAI
Gemma-3-27bOpen-weight (Google)Arena
Grok 4.2Open-weight (xAI)Native API
Phi-4Open-weight (Microsoft)HuggingFace
SciFact-searchSpecialized — scientific corpusHuggingFace

Run 3 — Spread Matrix

30 pairs. Spread = max − min across all 20 models. Flag threshold: ≥ 0.20. 27 of 30 pairs flagged.

IDPairCategoryMinMaxSpreadFlag
C1Printing pressControl0.650.95
0.30
C2Library of AlexandriaControl0.800.95
0.15
C3Rosetta StoneControl0.750.98
0.23
X4Benin BronzesCross-Cultural0.250.75
0.50
X5Ife sculpture / asheCross-Cultural0.180.95
0.77
X6Aboriginal dot paintings / TjukurpaCross-Cultural0.150.85
0.70
X7Ayahuasca (clinical vs sacred)Cross-Cultural0.380.90
0.52
X8Early internet (academic vs lived)Cross-Cultural0.350.95
0.60
E9West African trade (goods vs oral knowledge)Erasure0.280.92
0.64
E10Irish famine (statistics vs cultural loss)Erasure0.250.90
0.65
E11Endangered languages (classification vs ontology)Erasure0.300.95
0.65
E12Analog-digital (technical vs interpretive loss)Erasure0.180.90
0.72
D13Climate (universal vs indigenous framing)Divergence0.100.70
0.60
D14Digitization (access vs extraction)Divergence0.200.90
0.70
D15Oral tradition (unreliable vs high-fidelity)Divergence0.050.92
0.87
F16Silk Road (foil control)Foil Control0.850.97
0.12
F17Ukiyo-e (foil control — non-Western)Foil Control0.901.00
0.10
R18Panama Canal (reverse foil)Reverse Foil0.700.95
0.25
R19Cotton gin (reverse foil — added context)Reverse Foil0.480.90
0.42
X20Maori haka (war dance vs identity/genealogy)Cross-Cultural0.150.90
0.75
X21Chinese medicine (alternative vs empirical)Cross-Cultural0.200.95
0.75
X22Arabic calligraphy (decorative vs theological)Cross-Cultural0.150.90
0.75
X23Inca khipu (no writing vs undeciphered encoding)Cross-Cultural0.100.90
0.80
X24Sami reindeer herding (livelihood vs ontology)Cross-Cultural0.200.90
0.70
E25Partition of India (migration vs culture loss)Erasure0.300.90
0.60
E26Khmer Rouge (deaths vs transmission severance)Erasure0.280.95
0.67
E27Roma (discrimination vs destroyed networks)Erasure0.350.85
0.50
E28Marshallese navigation (sea level vs knowledge)Erasure0.180.90
0.72
D29Archaeology (evidence vs material survival bias)Divergence0.300.85
0.55
D30AI text (indistinguishable vs lacking lived experience)Divergence0.050.85
0.80

Spread bar color: red ≥ 0.70 · amber ≥ 0.50 · blue ≥ 0.20 · gray below threshold.

Category Averages

CategoryPairsAvg spreadInterpretation
Control (C1–3)30.23Baseline agreement on Western-academic pairs. Calibration anchor.
Foil Control (F16–17)20.11Non-Western content, matched framing. Ensemble agrees. Divergence is epistemological, not cultural.
Reverse Foil (R18–19)20.34Different words, same meaning. Models partially track meaning, partially track surface.
Cross-Cultural (X4–8, X20–24)100.68Western vs indigenous framing. Consistent high spread across 8 cultures.
Erasure-Sensitive (E9–12, E25–28)80.64Historical absence and cultural loss. High spread — models disagree on what counts as the event.
Divergence-Detection (D13–15, D29–30)50.70Epistemological opposition. Highest spread category. D15 and D30 near ceiling.
Next steps: Run 3 data is Tier B. First Tier A run requires full CISP v1.1 auditor isolation and Technician's Read #0 written before any model touches the stimulus set. High-spread pairs from this run (D15, D30, X23, X5) are priority candidates for the Bridge Experiment and full BSA defense elicitation.