UncategorizedApril 7, 2026

Gap Analysis Development Notes v1.3

A personal GPT session testing epistemic regime behavior across capability and grounding conditions. Tier C. Not in suite.

K.C. Hoye | Atlas Heritage Systems | April 7, 2026
Status: Tier C — Atlas-loaded session, no isolation record, Atlas-adjacent seed throughout.
Classification: Development note. Not added to suite. Not claimed as Tier A data.

1. What this is and what it isn't

This is a development note, not a protocol paper. It documents a morning session run in a personal GPT workspace on April 7, 2026, with the experimental prompts generated by a prior Skywork instance. The session proceeded in GPT until the data collection was complete, at which point it was handed off to a second Skywork instance for formalization. The decision was then made not to add this experiment to the diagnostic suite — not because it failed, but because Atlas already has more instruments than it currently has runway to run cleanly. The right move is to finish what's on the bench before building more benches.

The session was Tier C from the start. Atlas content was in the session box throughout. The seed was contaminated before the first prompt dropped. That classification doesn't change what the data shows — it changes what the data is licensed to claim.

What the session found: there are at least three distinguishable epistemic regimes in how Gemini-family models construct a knowledge field under a fixed prompt, and those regimes are reliably elicited by the interaction of model capability and grounding access. The grounding toggle does not uniformly suppress or expand output. It produces opposite effects at different capability levels. Under a controlled design, that interaction signal is clean enough to be methodologically significant.

That's worth writing down. It's not worth adding a sixth instrument to a stack where instruments one through five haven't run a Tier A session yet.

2. Background and prompt origins

The research prompts used in this session were generated by a prior Skywork instance as part of Atlas's cross-lineage prompt development work. The prompt was designed to probe a literature space at the intersection of model behavior, epistemic compression, and knowledge representation — structured as a seven-area search task with explicit instructions not to synthesize, not to fill gaps, and to report only what verifiably exists.

The seven search areas used in this session were a domain-shifted version of the original Atlas cultural-preservation prompt, moved into structural/technical LLM behavior territory:

·Temporal boundary artifacts in LLM behavior
·Calibration decay under domain shift as a structural measurement
·Source-type conflation in LLM generation
·Low-resource language capability topology
·Negation and logical asymmetry in probability distributions
·Scientific consensus sensitivity as behavioral measurement
·Tacit and procedural knowledge representation gaps

The domain shift was deliberate — to test whether the behavioral patterns (gap suppression, synthetic filling, capability-grounding interaction) were domain-specific or domain-invariant. They are domain-invariant. The same three-regime structure appeared here as in the original Atlas cultural domain run.

3. Design

Experimental design: 3×2 full factorial.

Factor	Levels
Model	Gemini 3.1 Flash Lite Preview / Gemini 3 Flash Preview / Gemini 3 Pro Preview
Grounding	ON (Google Search) / OFF

Six conditions total. Same prompt, identical across all runs. Fresh session each run. Temperature 1.0, thinking level High, all other tools off, no follow-up turns.

Temperature 1.0 is not a neutral setting. At high temperature, models face maximum pressure to complete missing knowledge. The design is therefore not only testing "what models know" — it is testing how models behave under pressure to fill uncertainty. That framing turns out to be the more interesting question.

4. Errors and corrections

Error 1 — Flash ON: Maps grounding substituted for Google Search

During the Flash model runs, an ON condition was submitted with Google Maps grounding active instead of Google Search. The retrieval corpus was wrong — Maps sources location and business data, not academic and technical literature. The output appeared grounded but was pulling from a categorically different retrieval graph.

The error was caught before the data was recorded. The run was logged as invalid and discarded immediately. The valid Flash pair (OFF: 6,430 / ON: 7,373) was collected on a fresh rerun with Google Search confirmed active.

The Maps run is not just noise — it is a clean live demonstration of tool consistency failure. The toggle label in the UI reads the same way regardless of which grounding source is active. Without checking the actual source list in the output footer, a Maps-grounded run is indistinguishable from a Google Search-grounded run at the protocol level. This is the specific failure mode CISP's tool consistency requirement is designed to prevent. Future BSA runs need to record the grounding source, not just the toggle state.

Error 2 — Flash Lite: initial token anomaly

An earlier Flash Lite run produced ON and OFF token counts of approximately 2,800 each — near-identical, falsely suggesting no grounding effect for that model. The rerun produced the correct values (OFF: 3,155 / ON: 4,059), restoring the expected expansion pattern.

5. Data

Complete token matrix — final clean values

Run	Model	Grounding	Tokens
A1	Flash	OFF	6,430
A2	Flash	ON	7,373
A3	Flash Lite	OFF	3,155
A4	Flash Lite	ON	4,059
A5	Pro	ON	4,984
A6	Pro	OFF	9,622

Discarded: Flash Lite ON (Maps): 2,968 tokens — invalid retrieval corpus, excluded from analysis.

EEV — within-model grounding effect (ON minus OFF)

Model	OFF	ON	Δ (EEV)	Direction
Flash Lite	3,155	4,059	+904	Expansion
Flash	6,430	7,373	+943	Expansion
Pro	9,622	4,984	−4,638	Compression

The sign flips between Flash and Pro. Grounding expands output for both lower-capability models by roughly the same margin (~900 tokens). Grounding compresses Pro output by nearly five times that amount.

Gap flags — by area and run

Area	Flash OFF	Flash ON	Lite OFF	Lite ON	Pro OFF	Pro ON
1 — Temporal boundary	None	None	None	Explicit gap	None	None
2 — Calibration decay	None	None	None	None	None	None
3 — Source-type conflation	Partial gap	None	None	None	Explicit gap	Explicit gap
4 — Low-resource capability	None	None	None	None	None	Partial gap
5 — Negation / asymmetry	None	None	Explicit gap	Explicit gap	None	None
6 — Scientific consensus	Partial gap	None	None	None	Explicit gap	None
7 — Tacit/procedural	Partial gap	None	Gap flag	None	Explicit gap	None

6. What the data shows

The grounding effect is capability-dependent, not uniform.

The initial framing — grounding constrains output — was wrong. Flash Lite expands by +904 tokens under grounding. Flash expands by +943. Pro compresses by −4,638. The same toggle, applied to models at different capability levels, produces opposite behavioral effects. This is an interaction effect, not a main effect.

The interpretation: lower-capability models use grounding as scaffolding. It gives them structure they cannot generate internally, so they produce more. Pro has internal structure sufficient to generate a full synthetic field without external input — its OFF run at 9,622 tokens is the largest output in the matrix. Grounding constrains Pro toward verified citations and away from synthetic elaboration, so it produces less but more precisely targeted content.

Three behavioral regimes, characterized by output structure rather than size.

Completion regime (Flash OFF, Pro OFF): Models generating freely at temperature 1.0 fill the research space uniformly. Both show full-coverage outputs across all seven areas. Pro OFF is larger because it has more internal structure to draw on. But both are operating in the same mode: accept the prompt's ontological categories as real, backfill with plausible content. Citation quality in Flash OFF is high-PCR — suspiciously clean titles, arXiv-style links, symmetric coverage across all areas.

Constraint regime (Flash Lite ON, Flash ON): Under grounding, both Lite and Flash expand. The expansion is not contradiction — it means grounding gives these models access to more than their internal representations can generate. Both runs show real source diversity — aclanthology.org, openreview.net, actual arXiv IDs — that the ungrounded runs lack.

Synthesis regime (Pro ON): Pro ON (4,984) is smaller than Pro OFF (9,622). Under grounding pressure, Pro does not simply retrieve — it selects. The gap flags are honest. Pro ON is the most epistemically accurate output in the matrix — it finds less, but what it finds is more grounded, and where nothing grounded exists, it says so.

The phase transition.

The EEV sign flip between Flash and Pro is not a gradual gradient. Flash expands under grounding at nearly the same rate as Lite (+943 vs +904). Then Pro inverts by a factor of five. The behavioral regime change between Flash and Pro is threshold-like, not continuous. Whether the threshold is a capability threshold, a training corpus effect, or an RLHF-regime difference is not determinable from this data. The PyHessian instrument is the mechanism for testing the geometric claim. This session can only observe the behavior.

Gap stability — what areas survived all conditions:

·True gap (high confidence): Area 3 (Source-type conflation) and Area 7 (Tacit/procedural knowledge). Both received gap flags in the Pro grounded conditions, which are the highest-reliability runs in the matrix.
·Capability-dependent gap: Area 5 (Negation/asymmetry). Only Lite flags it. Flash and Pro both produce results.
·Emerging / fragmented: Area 6 (Scientific consensus sensitivity). Gap in Pro OFF, partial fill elsewhere.
·Established territory: Areas 2 and 4 (Calibration decay, Low-resource capability topology). All runs produced results.

7. Why it's not going in the suite

Atlas already has five governed instruments in the stack — ECM, BSA, Divergence Testing, GG-CSAP, and PyHessian — none of which have Tier A data yet. Adding a sixth instrument increases the surface area that needs to be governed, logged, and maintained before re-entry.

The experiment is also not sufficiently differentiated from BSA to justify its own instrument status. The grounding toggle is already in BSA's 3×2 factorial design. The capability variable is already in the Canary Ensemble logic. What this session adds is a behavioral characterization of the three output regimes — and the specific finding that EEV sign-flips across the capability threshold — that BSA's existing metrics (EEV, PCR, GSI) can capture without a new instrument wrapper.

Specific design inputs to BSA from this session:

·Bracket the capability threshold. The formal BSA run should include at least one model on each side of it. Running only lower-capability models will show expansion effects; running only Pro-tier will show compression. The interaction effect requires both.
·EEV sign behavior. EEV as "grounding suppresses output" is only true above the capability threshold. Below it, EEV is positive (expansion). BSA analysis should not interpret EEV directionality without first classifying the model's capability tier.
·Area 3 and Area 7 as candidate stimulus areas. Both received consistent gap flags in the Pro grounded conditions. Good candidates for BSA's contested stimulus set as known-gap control conditions.
·Tool consistency logging. The Maps error is a clean demonstration of why CISP's tool consistency field needs to record not just "web ON/OFF" but which grounding source was active.

8. Classification and standing

Tier C. Atlas-loaded session box. Prompt was Atlas-adjacent. No isolation record. No CISP versioning. No fresh session enforcement beyond researcher discipline.

session_context: atlas-loaded
prompt_context: atlas-referenced

The findings are directional, not confirmatory. They are logged here as development notes and will inform BSA design. They will not be merged with Tier A data. If any values from this session ever appear in the pipeline, they go in as Tier C rows with both contamination fields flagged, in separate columns from governed data. The Maps run (2,968 tokens) is excluded entirely and will not be ingested at any tier.

All articles on this website are an artefact of its creation, LLM synthesis and review are used to verify data and citations Atlas Heritage Systems Inc. — Endurance. Integrity. Fidelity.
Gap Analysis Development Notes v1.3 — April 7, 2026 — Tier C. Not peer-reviewed. Not Tier A data. Not in suite.
Supersedes v1.0, v1.1, and v1.2.