Method / Divergence Testing Protocol

Divergence Testing Protocol

v1.0 · April 2026 · Atlas Heritage Systems · Protocol complete — executable immediately

Divergence Testing is not BSA. It produces a spread matrix — where models disagree — and nothing more. Gap structure, hallucination rates, concept density, and ECM behavioral metrics belong to the full BSA protocol. Run this standalone for rapid ensemble probing or stimulus validation. Run it as BSA Phase 2 when running a full BSA.

Scope Boundary

Divergence Testing produces	BSA additionally produces
Per-pair scores per model	Defense / explanation text
Spread matrix (max − min)	Gap flags by domain
Divergence flags (spread ≥ threshold)	ECM behavioral metrics (R, P, code)
	EEV, PCR, GSI, concept density
	CISP-governed synthesis

Purpose

Compute semantic similarity scores across a model ensemble on a designed stimulus set, identify pairs where models diverge substantially, and produce a flagged spread matrix. Divergence is treated as signal — it marks where epistemic tension lives in the training data, not where models are wrong.

Prerequisites

◆

Stimulus set versioned in stimuli/stimulus_registry.csv

◆

Fresh session per model confirmed — no cross-contamination

◆

Models identified with exact version and access method logged

◆

If standalone: Technician's Read #0 written before any model runs

◆

If BSA Phase 2: Pre-flight already complete — proceed directly

The Scoring Prompt

Present this prompt to each model in a completely fresh session. One session per model — no carryover, no Atlas context. Record all outputs verbatim. No cleaning, no correction.

You will be shown text pairs. For each pair, rate the semantic similarity on a scale from 0.00 to 1.00 where: 0.00 = completely unrelated in meaning 0.50 = related topic but substantially different perspective or content 1.00 = identical in meaning Respond ONLY with the pair number and your score. No explanations. Be precise to two decimal places. Example: 1: 0.72 2: 0.45 [Paste stimulus pairs here]

Session isolation

One fresh session per model. Operator does not compare outputs until all models have completed.

Split runs

If a model cannot process the full set, split into runs (e.g. 1–15, 16–30) and log as sub-runs.

Verbatim recording

Record all outputs exactly as returned. No cleaning, no correction, no interpretation at this stage.

Divergence Calculation

After all models score, compute for every pair:

Spread = max(scores across models) − min(scores across models)

Flag Thresholds

Ensemble size	Flag threshold	Notes
5+ models	Spread ≥ 0.20	Standard
3–4 models	Spread ≥ 0.15	Smaller ensemble, lower threshold
2 models	Spread ≥ 0.10	Pilot / shakedown only

Secondary Flags

◆

Hallucination convergence — All models converge high (≥ 0.80) on ORTHO stimuli — models agree on a pair they should not.

◆

Fluency-without-content — All models converge high on fabricated or absurd stimuli — surface plausibility masking absence of content.

◆

Lineage signal — One model is a consistent outlier across multiple pairs — a corpus or alignment signature, not noise.

Output: Spread Matrix

The spread matrix is the final deliverable of Divergence Testing. If running standalone, stop here. If running as BSA Phase 2, pass flagged pairs to Phase 3 (Defense Elicitation).

Pair ID	Type	Model A	Model B	Model C	Model D	Spread	Flag
P01	CONTEST	0.72	0.45	0.61	0.38	0.34	✓
P02	ALIGN	0.89	0.91	0.87	0.90	0.04
P03	ORTHO	0.12	0.09	0.78	0.11	0.69	✓ HALLUC

Logging

Log each model's run using scripts/log_bsa_run.py. Set grounding: OFF unless grounding conditions were applied. Leave eev, concept_density, pcr, and gsi blank — those are BSA Phase 5. Record flag count and any lineage outlier observations in the notes field. If standalone, leave condition_matrix_id blank unless part of a factorial batch.

CISP Fidelity Tiers

Standalone Divergence Testing runs can achieve Tier A independently. Full BSA completion is not required.

Tier A

Fresh session per model, no Atlas context, stimulus versioned, Technician's Read #0 pre-run, outputs recorded verbatim.

Tier B

Incomplete isolation or unversioned stimulus.

Tier C

Pre-protocol runs, context leakage, no session record.

Reproducibility Package

Folder: DIVTEST-[RunID]-[Date]/

◆

Raw model outputs — verbatim, labeled by model and version

◆

Exact prompt used

◆

Spread matrix with flags

◆

Stimulus set file with version ID

◆

Technician's Read #0 (if standalone)

◆

Session log — model names, versions, access method, timestamps

Run data: Three runs completed. Run 3 is the highest fidelity — 20 models, 30 pairs, 600 data points. See the Divergence Testing Run Data page for the full spread matrix.

← BSA GG-CSAP →