Method / Divergence Testing

Divergence Testing — Run Data

3 runs · 20 models · 30 pairs · 600 data points · April 2026 · Atlas Heritage Systems

Divergence is signal, not error. Where models disagree is where epistemic tension lives in the training data — a record of whose framings made it into the corpus and whose didn't. The spread matrix doesn't tell you who is right. It tells you where the seams are.

What This Is

Divergence Testing asks a model ensemble to rate the semantic similarity of paired statements on a 0.00–1.00 scale. Each pair presents the same subject from two different epistemic framings — one typically Western-academic, one drawn from indigenous, oral, or non-dominant knowledge traditions. Models score independently in fresh sessions with no Atlas context.

The spread (max − min across the ensemble) measures how much models disagree. High spread on a pair means models have fundamentally different representations of whether those two framings are saying the same thing. That disagreement is the data.

Stack Position

Standalone

Rapid ensemble probing, stimulus validation, shakedown runs. Produces a spread matrix and stops. Can achieve Tier A independently.

BSA Phase 2

Feeds directly into Phase 3 (Defense Elicitation) and Phase 4 (ECM Behavioral Scoring). The spread matrix identifies which pairs get defense runs.

Bridge Experiment input

High-spread pairs from Divergence Testing are candidates for PyHessian probing — connecting behavioral signal on the outside to loss landscape geometry on the inside.

For the full protocol, see Divergence Testing Protocol v1.0.

Run History

Run	Models	Pairs	Data points	Fidelity	Notes
Run 1	5	15	75	Tier C	Proof of concept. No session isolation, stimulus unversioned.
Run 2	7	20	140	Tier B	Expanded stimulus set. Partial isolation — some context leakage.
Run 3	20	30	600	Tier B	Full ensemble, versioned stimulus, fresh sessions. Missing Technician's Read #0 pre-run.

Run 3 — Summary

20 models across 8 lineages. 30 stimulus pairs across 6 categories. Flag threshold: spread ≥ 0.20 (20+ model ensemble, standard).

27/30

Flagged pairs

0.23

Avg spread — Control

0.68

Avg spread — Cross-Cultural

0.70

Avg spread — Divergence

The staircase holds

Control pairs average 0.23 spread. Cross-cultural pairs average 0.68. Divergence-detection pairs average 0.70. The gradient is consistent and steep — exactly the signal the experiment was designed to find.

Foil controls passed

F16 (Silk Road) and F17 (Ukiyo-e) — both non-Western content pairs with matching framings — scored tight across the ensemble (spread 0.12 and 0.10). The ensemble can agree on non-Western content when the framing is consistent. The divergence is epistemological, not cultural.

D15 is the high-water mark

Oral tradition (unreliable vs high-fidelity) produced the highest spread in the run at 0.87 — scores ranging from 0.05 to 0.92. Some models read these as near-identical claims about oral tradition. Others read them as near-opposites. That's not measurement noise. That's a training corpus signature.

Run 3 — Ensemble

Model	Lineage	Access
Skywork Pro	Chinese (Kunlun Tech)	Native API
DeepSeek 3.2	Chinese (DeepSeek)	DeepAI
Qwen 3.5-max	Chinese (Alibaba)	Arena
Qwen3.5-122b	Chinese (Alibaba, larger)	Arena
Cohere Expanse	Multilingual — 23 languages (Cohere)	HuggingFace
IBM Granite	Enterprise (IBM)	Arena
Tiny Aya	Multilingual — underrepresented language focus (Cohere)	HuggingFace
Z-glm-5	Chinese (Zhipu AI / Tsinghua)	Arena
Claude Opus 4.5	Western Commercial (Anthropic)	DeepAI
GPT-5.2	Western Commercial (OpenAI)	DeepAI
Gemini 3.1 Pro	Western Commercial (Google)	DeepAI
Perplexity	Western Commercial — search-augmented	Native API
Mistral-large-3	French (Mistral AI)	Arena
Mistral-medium-2505	French (Mistral AI)	Arena
Mistral-small-2603	French (Mistral AI)	Arena
Llama 3.3 70B	Open-weight (Meta)	DeepAI
Gemma-3-27b	Open-weight (Google)	Arena
Grok 4.2	Open-weight (xAI)	Native API
Phi-4	Open-weight (Microsoft)	HuggingFace
SciFact-search	Specialized — scientific corpus	HuggingFace

Run 3 — Spread Matrix

30 pairs. Spread = max − min across all 20 models. Flag threshold: ≥ 0.20. 27 of 30 pairs flagged.

ID	Pair	Category	Min	Max	Spread	Flag
C1	Printing press	Control	0.65	0.95	0.30	✓
C2	Library of Alexandria	Control	0.80	0.95	0.15
C3	Rosetta Stone	Control	0.75	0.98	0.23	✓
X4	Benin Bronzes	Cross-Cultural	0.25	0.75	0.50	✓
X5	Ife sculpture / ashe	Cross-Cultural	0.18	0.95	0.77	✓
X6	Aboriginal dot paintings / Tjukurpa	Cross-Cultural	0.15	0.85	0.70	✓
X7	Ayahuasca (clinical vs sacred)	Cross-Cultural	0.38	0.90	0.52	✓
X8	Early internet (academic vs lived)	Cross-Cultural	0.35	0.95	0.60	✓
E9	West African trade (goods vs oral knowledge)	Erasure	0.28	0.92	0.64	✓
E10	Irish famine (statistics vs cultural loss)	Erasure	0.25	0.90	0.65	✓
E11	Endangered languages (classification vs ontology)	Erasure	0.30	0.95	0.65	✓
E12	Analog-digital (technical vs interpretive loss)	Erasure	0.18	0.90	0.72	✓
D13	Climate (universal vs indigenous framing)	Divergence	0.10	0.70	0.60	✓
D14	Digitization (access vs extraction)	Divergence	0.20	0.90	0.70	✓
D15	Oral tradition (unreliable vs high-fidelity)	Divergence	0.05	0.92	0.87	✓
F16	Silk Road (foil control)	Foil Control	0.85	0.97	0.12
F17	Ukiyo-e (foil control — non-Western)	Foil Control	0.90	1.00	0.10
R18	Panama Canal (reverse foil)	Reverse Foil	0.70	0.95	0.25	✓
R19	Cotton gin (reverse foil — added context)	Reverse Foil	0.48	0.90	0.42	✓
X20	Maori haka (war dance vs identity/genealogy)	Cross-Cultural	0.15	0.90	0.75	✓
X21	Chinese medicine (alternative vs empirical)	Cross-Cultural	0.20	0.95	0.75	✓
X22	Arabic calligraphy (decorative vs theological)	Cross-Cultural	0.15	0.90	0.75	✓
X23	Inca khipu (no writing vs undeciphered encoding)	Cross-Cultural	0.10	0.90	0.80	✓
X24	Sami reindeer herding (livelihood vs ontology)	Cross-Cultural	0.20	0.90	0.70	✓
E25	Partition of India (migration vs culture loss)	Erasure	0.30	0.90	0.60	✓
E26	Khmer Rouge (deaths vs transmission severance)	Erasure	0.28	0.95	0.67	✓
E27	Roma (discrimination vs destroyed networks)	Erasure	0.35	0.85	0.50	✓
E28	Marshallese navigation (sea level vs knowledge)	Erasure	0.18	0.90	0.72	✓
D29	Archaeology (evidence vs material survival bias)	Divergence	0.30	0.85	0.55	✓
D30	AI text (indistinguishable vs lacking lived experience)	Divergence	0.05	0.85	0.80	✓

Spread bar color: red ≥ 0.70 · amber ≥ 0.50 · blue ≥ 0.20 · gray below threshold.

Category Averages

Category	Pairs	Avg spread	Interpretation
Control (C1–3)	3	0.23	Baseline agreement on Western-academic pairs. Calibration anchor.
Foil Control (F16–17)	2	0.11	Non-Western content, matched framing. Ensemble agrees. Divergence is epistemological, not cultural.
Reverse Foil (R18–19)	2	0.34	Different words, same meaning. Models partially track meaning, partially track surface.
Cross-Cultural (X4–8, X20–24)	10	0.68	Western vs indigenous framing. Consistent high spread across 8 cultures.
Erasure-Sensitive (E9–12, E25–28)	8	0.64	Historical absence and cultural loss. High spread — models disagree on what counts as the event.
Divergence-Detection (D13–15, D29–30)	5	0.70	Epistemological opposition. Highest spread category. D15 and D30 near ceiling.

Next steps: Run 3 data is Tier B. First Tier A run requires full CISP v1.1 auditor isolation and Technician's Read #0 written before any model touches the stimulus set. High-spread pairs from this run (D15, D30, X23, X5) are priority candidates for the Bridge Experiment and full BSA defense elicitation.