Method / Divergence Testing Protocol
Divergence Testing Protocol
v1.0 · April 2026 · Atlas Heritage Systems · Protocol complete — executable immediately
Scope Boundary
| Divergence Testing produces | BSA additionally produces |
|---|---|
| Per-pair scores per model | Defense / explanation text |
| Spread matrix (max − min) | Gap flags by domain |
| Divergence flags (spread ≥ threshold) | ECM behavioral metrics (R, P, code) |
| EEV, PCR, GSI, concept density | |
| CISP-governed synthesis |
Purpose
Compute semantic similarity scores across a model ensemble on a designed stimulus set, identify pairs where models diverge substantially, and produce a flagged spread matrix. Divergence is treated as signal — it marks where epistemic tension lives in the training data, not where models are wrong.
Prerequisites
Stimulus set versioned in stimuli/stimulus_registry.csv
Fresh session per model confirmed — no cross-contamination
Models identified with exact version and access method logged
If standalone: Technician's Read #0 written before any model runs
If BSA Phase 2: Pre-flight already complete — proceed directly
The Scoring Prompt
Present this prompt to each model in a completely fresh session. One session per model — no carryover, no Atlas context. Record all outputs verbatim. No cleaning, no correction.
One fresh session per model. Operator does not compare outputs until all models have completed.
If a model cannot process the full set, split into runs (e.g. 1–15, 16–30) and log as sub-runs.
Record all outputs exactly as returned. No cleaning, no correction, no interpretation at this stage.
Divergence Calculation
After all models score, compute for every pair:
Flag Thresholds
| Ensemble size | Flag threshold | Notes |
|---|---|---|
| 5+ models | Spread ≥ 0.20 | Standard |
| 3–4 models | Spread ≥ 0.15 | Smaller ensemble, lower threshold |
| 2 models | Spread ≥ 0.10 | Pilot / shakedown only |
Secondary Flags
Output: Spread Matrix
The spread matrix is the final deliverable of Divergence Testing. If running standalone, stop here. If running as BSA Phase 2, pass flagged pairs to Phase 3 (Defense Elicitation).
| Pair ID | Type | Model A | Model B | Model C | Model D | Spread | Flag |
|---|---|---|---|---|---|---|---|
| P01 | CONTEST | 0.72 | 0.45 | 0.61 | 0.38 | 0.34 | ✓ |
| P02 | ALIGN | 0.89 | 0.91 | 0.87 | 0.90 | 0.04 | |
| P03 | ORTHO | 0.12 | 0.09 | 0.78 | 0.11 | 0.69 | ✓ HALLUC |
Logging
Log each model's run using scripts/log_bsa_run.py. Set grounding: OFF unless grounding conditions were applied. Leave eev, concept_density, pcr, and gsi blank — those are BSA Phase 5. Record flag count and any lineage outlier observations in the notes field. If standalone, leave condition_matrix_id blank unless part of a factorial batch.
CISP Fidelity Tiers
Standalone Divergence Testing runs can achieve Tier A independently. Full BSA completion is not required.
Fresh session per model, no Atlas context, stimulus versioned, Technician's Read #0 pre-run, outputs recorded verbatim.
Incomplete isolation or unversioned stimulus.
Pre-protocol runs, context leakage, no session record.
Reproducibility Package
Folder: DIVTEST-[RunID]-[Date]/
Raw model outputs — verbatim, labeled by model and version
Exact prompt used
Spread matrix with flags
Stimulus set file with version ID
Technician's Read #0 (if standalone)
Session log — model names, versions, access method, timestamps