Method / Divergence Testing Protocol

Divergence Testing Protocol

v1.0 · April 2026 · Atlas Heritage Systems · Protocol complete — executable immediately

Divergence Testing is not BSA. It produces a spread matrix — where models disagree — and nothing more. Gap structure, hallucination rates, concept density, and ECM behavioral metrics belong to the full BSA protocol. Run this standalone for rapid ensemble probing or stimulus validation. Run it as BSA Phase 2 when running a full BSA.

Scope Boundary

Divergence Testing producesBSA additionally produces
Per-pair scores per modelDefense / explanation text
Spread matrix (max − min)Gap flags by domain
Divergence flags (spread ≥ threshold)ECM behavioral metrics (R, P, code)
EEV, PCR, GSI, concept density
CISP-governed synthesis

Purpose

Compute semantic similarity scores across a model ensemble on a designed stimulus set, identify pairs where models diverge substantially, and produce a flagged spread matrix. Divergence is treated as signal — it marks where epistemic tension lives in the training data, not where models are wrong.

Prerequisites

Stimulus set versioned in stimuli/stimulus_registry.csv

Fresh session per model confirmed — no cross-contamination

Models identified with exact version and access method logged

If standalone: Technician's Read #0 written before any model runs

If BSA Phase 2: Pre-flight already complete — proceed directly

The Scoring Prompt

Present this prompt to each model in a completely fresh session. One session per model — no carryover, no Atlas context. Record all outputs verbatim. No cleaning, no correction.

You will be shown text pairs. For each pair, rate the semantic similarity on a scale from 0.00 to 1.00 where: 0.00 = completely unrelated in meaning 0.50 = related topic but substantially different perspective or content 1.00 = identical in meaning Respond ONLY with the pair number and your score. No explanations. Be precise to two decimal places. Example: 1: 0.72 2: 0.45 [Paste stimulus pairs here]
Session isolation

One fresh session per model. Operator does not compare outputs until all models have completed.

Split runs

If a model cannot process the full set, split into runs (e.g. 1–15, 16–30) and log as sub-runs.

Verbatim recording

Record all outputs exactly as returned. No cleaning, no correction, no interpretation at this stage.

Divergence Calculation

After all models score, compute for every pair:

Spread = max(scores across models) − min(scores across models)

Flag Thresholds

Ensemble sizeFlag thresholdNotes
5+ modelsSpread ≥ 0.20Standard
3–4 modelsSpread ≥ 0.15Smaller ensemble, lower threshold
2 modelsSpread ≥ 0.10Pilot / shakedown only

Secondary Flags

Hallucination convergenceAll models converge high (≥ 0.80) on ORTHO stimuli — models agree on a pair they should not.
Fluency-without-contentAll models converge high on fabricated or absurd stimuli — surface plausibility masking absence of content.
Lineage signalOne model is a consistent outlier across multiple pairs — a corpus or alignment signature, not noise.

Output: Spread Matrix

The spread matrix is the final deliverable of Divergence Testing. If running standalone, stop here. If running as BSA Phase 2, pass flagged pairs to Phase 3 (Defense Elicitation).

Pair IDTypeModel AModel BModel CModel DSpreadFlag
P01CONTEST0.720.450.610.380.34
P02ALIGN0.890.910.870.900.04
P03ORTHO0.120.090.780.110.69✓ HALLUC

Logging

Log each model's run using scripts/log_bsa_run.py. Set grounding: OFF unless grounding conditions were applied. Leave eev, concept_density, pcr, and gsi blank — those are BSA Phase 5. Record flag count and any lineage outlier observations in the notes field. If standalone, leave condition_matrix_id blank unless part of a factorial batch.

CISP Fidelity Tiers

Standalone Divergence Testing runs can achieve Tier A independently. Full BSA completion is not required.

Tier A

Fresh session per model, no Atlas context, stimulus versioned, Technician's Read #0 pre-run, outputs recorded verbatim.

Tier B

Incomplete isolation or unversioned stimulus.

Tier C

Pre-protocol runs, context leakage, no session record.

Reproducibility Package

Folder: DIVTEST-[RunID]-[Date]/

Raw model outputs — verbatim, labeled by model and version

Exact prompt used

Spread matrix with flags

Stimulus set file with version ID

Technician's Read #0 (if standalone)

Session log — model names, versions, access method, timestamps

Run data: Three runs completed. Run 3 is the highest fidelity — 20 models, 30 pairs, 600 data points. See the Divergence Testing Run Data page for the full spread matrix.