Method / Divergence Testing
Divergence Testing — Run Data
3 runs · 20 models · 30 pairs · 600 data points · April 2026 · Atlas Heritage Systems
What This Is
Divergence Testing asks a model ensemble to rate the semantic similarity of paired statements on a 0.00–1.00 scale. Each pair presents the same subject from two different epistemic framings — one typically Western-academic, one drawn from indigenous, oral, or non-dominant knowledge traditions. Models score independently in fresh sessions with no Atlas context.
The spread (max − min across the ensemble) measures how much models disagree. High spread on a pair means models have fundamentally different representations of whether those two framings are saying the same thing. That disagreement is the data.
Stack Position
Rapid ensemble probing, stimulus validation, shakedown runs. Produces a spread matrix and stops. Can achieve Tier A independently.
Feeds directly into Phase 3 (Defense Elicitation) and Phase 4 (ECM Behavioral Scoring). The spread matrix identifies which pairs get defense runs.
High-spread pairs from Divergence Testing are candidates for PyHessian probing — connecting behavioral signal on the outside to loss landscape geometry on the inside.
For the full protocol, see Divergence Testing Protocol v1.0.
Run History
| Run | Models | Pairs | Data points | Fidelity | Notes |
|---|---|---|---|---|---|
| Run 1 | 5 | 15 | 75 | Tier C | Proof of concept. No session isolation, stimulus unversioned. |
| Run 2 | 7 | 20 | 140 | Tier B | Expanded stimulus set. Partial isolation — some context leakage. |
| Run 3 | 20 | 30 | 600 | Tier B | Full ensemble, versioned stimulus, fresh sessions. Missing Technician's Read #0 pre-run. |
Run 3 — Summary
20 models across 8 lineages. 30 stimulus pairs across 6 categories. Flag threshold: spread ≥ 0.20 (20+ model ensemble, standard).
Control pairs average 0.23 spread. Cross-cultural pairs average 0.68. Divergence-detection pairs average 0.70. The gradient is consistent and steep — exactly the signal the experiment was designed to find.
F16 (Silk Road) and F17 (Ukiyo-e) — both non-Western content pairs with matching framings — scored tight across the ensemble (spread 0.12 and 0.10). The ensemble can agree on non-Western content when the framing is consistent. The divergence is epistemological, not cultural.
Oral tradition (unreliable vs high-fidelity) produced the highest spread in the run at 0.87 — scores ranging from 0.05 to 0.92. Some models read these as near-identical claims about oral tradition. Others read them as near-opposites. That's not measurement noise. That's a training corpus signature.
Run 3 — Ensemble
| Model | Lineage | Access |
|---|---|---|
| Skywork Pro | Chinese (Kunlun Tech) | Native API |
| DeepSeek 3.2 | Chinese (DeepSeek) | DeepAI |
| Qwen 3.5-max | Chinese (Alibaba) | Arena |
| Qwen3.5-122b | Chinese (Alibaba, larger) | Arena |
| Cohere Expanse | Multilingual — 23 languages (Cohere) | HuggingFace |
| IBM Granite | Enterprise (IBM) | Arena |
| Tiny Aya | Multilingual — underrepresented language focus (Cohere) | HuggingFace |
| Z-glm-5 | Chinese (Zhipu AI / Tsinghua) | Arena |
| Claude Opus 4.5 | Western Commercial (Anthropic) | DeepAI |
| GPT-5.2 | Western Commercial (OpenAI) | DeepAI |
| Gemini 3.1 Pro | Western Commercial (Google) | DeepAI |
| Perplexity | Western Commercial — search-augmented | Native API |
| Mistral-large-3 | French (Mistral AI) | Arena |
| Mistral-medium-2505 | French (Mistral AI) | Arena |
| Mistral-small-2603 | French (Mistral AI) | Arena |
| Llama 3.3 70B | Open-weight (Meta) | DeepAI |
| Gemma-3-27b | Open-weight (Google) | Arena |
| Grok 4.2 | Open-weight (xAI) | Native API |
| Phi-4 | Open-weight (Microsoft) | HuggingFace |
| SciFact-search | Specialized — scientific corpus | HuggingFace |
Run 3 — Spread Matrix
30 pairs. Spread = max − min across all 20 models. Flag threshold: ≥ 0.20. 27 of 30 pairs flagged.
| ID | Pair | Category | Min | Max | Spread | Flag |
|---|---|---|---|---|---|---|
| C1 | Printing press | Control | 0.65 | 0.95 | 0.30 | ✓ |
| C2 | Library of Alexandria | Control | 0.80 | 0.95 | 0.15 | |
| C3 | Rosetta Stone | Control | 0.75 | 0.98 | 0.23 | ✓ |
| X4 | Benin Bronzes | Cross-Cultural | 0.25 | 0.75 | 0.50 | ✓ |
| X5 | Ife sculpture / ashe | Cross-Cultural | 0.18 | 0.95 | 0.77 | ✓ |
| X6 | Aboriginal dot paintings / Tjukurpa | Cross-Cultural | 0.15 | 0.85 | 0.70 | ✓ |
| X7 | Ayahuasca (clinical vs sacred) | Cross-Cultural | 0.38 | 0.90 | 0.52 | ✓ |
| X8 | Early internet (academic vs lived) | Cross-Cultural | 0.35 | 0.95 | 0.60 | ✓ |
| E9 | West African trade (goods vs oral knowledge) | Erasure | 0.28 | 0.92 | 0.64 | ✓ |
| E10 | Irish famine (statistics vs cultural loss) | Erasure | 0.25 | 0.90 | 0.65 | ✓ |
| E11 | Endangered languages (classification vs ontology) | Erasure | 0.30 | 0.95 | 0.65 | ✓ |
| E12 | Analog-digital (technical vs interpretive loss) | Erasure | 0.18 | 0.90 | 0.72 | ✓ |
| D13 | Climate (universal vs indigenous framing) | Divergence | 0.10 | 0.70 | 0.60 | ✓ |
| D14 | Digitization (access vs extraction) | Divergence | 0.20 | 0.90 | 0.70 | ✓ |
| D15 | Oral tradition (unreliable vs high-fidelity) | Divergence | 0.05 | 0.92 | 0.87 | ✓ |
| F16 | Silk Road (foil control) | Foil Control | 0.85 | 0.97 | 0.12 | |
| F17 | Ukiyo-e (foil control — non-Western) | Foil Control | 0.90 | 1.00 | 0.10 | |
| R18 | Panama Canal (reverse foil) | Reverse Foil | 0.70 | 0.95 | 0.25 | ✓ |
| R19 | Cotton gin (reverse foil — added context) | Reverse Foil | 0.48 | 0.90 | 0.42 | ✓ |
| X20 | Maori haka (war dance vs identity/genealogy) | Cross-Cultural | 0.15 | 0.90 | 0.75 | ✓ |
| X21 | Chinese medicine (alternative vs empirical) | Cross-Cultural | 0.20 | 0.95 | 0.75 | ✓ |
| X22 | Arabic calligraphy (decorative vs theological) | Cross-Cultural | 0.15 | 0.90 | 0.75 | ✓ |
| X23 | Inca khipu (no writing vs undeciphered encoding) | Cross-Cultural | 0.10 | 0.90 | 0.80 | ✓ |
| X24 | Sami reindeer herding (livelihood vs ontology) | Cross-Cultural | 0.20 | 0.90 | 0.70 | ✓ |
| E25 | Partition of India (migration vs culture loss) | Erasure | 0.30 | 0.90 | 0.60 | ✓ |
| E26 | Khmer Rouge (deaths vs transmission severance) | Erasure | 0.28 | 0.95 | 0.67 | ✓ |
| E27 | Roma (discrimination vs destroyed networks) | Erasure | 0.35 | 0.85 | 0.50 | ✓ |
| E28 | Marshallese navigation (sea level vs knowledge) | Erasure | 0.18 | 0.90 | 0.72 | ✓ |
| D29 | Archaeology (evidence vs material survival bias) | Divergence | 0.30 | 0.85 | 0.55 | ✓ |
| D30 | AI text (indistinguishable vs lacking lived experience) | Divergence | 0.05 | 0.85 | 0.80 | ✓ |
Spread bar color: red ≥ 0.70 · amber ≥ 0.50 · blue ≥ 0.20 · gray below threshold.
Category Averages
| Category | Pairs | Avg spread | Interpretation |
|---|---|---|---|
| Control (C1–3) | 3 | 0.23 | Baseline agreement on Western-academic pairs. Calibration anchor. |
| Foil Control (F16–17) | 2 | 0.11 | Non-Western content, matched framing. Ensemble agrees. Divergence is epistemological, not cultural. |
| Reverse Foil (R18–19) | 2 | 0.34 | Different words, same meaning. Models partially track meaning, partially track surface. |
| Cross-Cultural (X4–8, X20–24) | 10 | 0.68 | Western vs indigenous framing. Consistent high spread across 8 cultures. |
| Erasure-Sensitive (E9–12, E25–28) | 8 | 0.64 | Historical absence and cultural loss. High spread — models disagree on what counts as the event. |
| Divergence-Detection (D13–15, D29–30) | 5 | 0.70 | Epistemological opposition. Highest spread category. D15 and D30 near ceiling. |