Confounds Map · v2 · FVE-1 Physics Isomorphism Claim

A note on scope and epistemics This map distinguishes two categories of potential objection. Tier 1 confounds are live methodological constraints that could independently distort the observed signal or invalidate the phase-transition interpretation. Tier 2 items are real disputes in adjacent literatures that do not apply to FVE-1 because FVE-1 does not make the cognition claims those debates concern. Importing Tier 2 objections as attacks on Tier 1 claims is a category error. This document records both tiers so that distinction is explicit and public. A note on scope and epistemics The physics and fluid dynamics literature in the associated map is not a claim about LLM behavior. It is a heuristic for generating testable hypotheses about loss landscape geometry — using well-characterized physical systems to triangulate where interesting structure might live in high-dimensional parameter space. PyHessian and related curvature measures are the instruments that check whether anything is actually there. This map scopes where that heuristic is reliable versus where it is overreaching. The two-tier framing below distinguishes live methodological constraints from scope notes about claims this project is not making.

Tier 1 — Live Methodological Constraints

Threatens the physics isomorphism claim specifically

These confounds could independently distort the observed signal, the chosen order parameter, or the interpretation that prompt-stress effects reflect a structured energy landscape with measurable phase-transition-like boundaries rather than prompt artifacts, measurement noise, or investigator-side construction effects.

A Tier 1 confound is not "something that could matter in general." It is something that could produce the observed pattern in the absence of the claimed mechanism. Each Tier 1 item is treated as a live methodological constraint on current and future FVE-1 interpretations.

Tier 2 — Scope Notes Only

Threatens cognition claims FVE-1 does not make

These are real disputes in psycholinguistics, cognitive science, and language acquisition that concern whether LLMs learn like children, possess innate priors, or demonstrate human-like world modeling.

FVE-1 does not make those claims. It treats models as fixed black boxes with observed surface behavior under controlled stress. Tier 2 items are retained only so that external debates cannot be incorrectly imported as objections to claims this project is not making.

01 · Prompt / Context Design Confounds

Ecological Validity Boundary — inv_prompt_texture Live

Confound

Researcher-texture and consumer-texture prompts may not sample the same behavioral regime. If intercept populations differ by texture, sessions cannot be pooled without stratification — and any cross-session aggregate that mixes textures may be describing two populations as one.

Current handling

inv_prompt_texture is logged on every session. Aggregation across texture conditions without explicit stratification is prohibited. AB specimen and transcripts exist. Controlled replication is feasible.

Status

Live confound. Declared in Schema V5.5 §11. Not resolved — managed by protocol discipline and stratification requirement.

Sources

Cheng, Hawkins & Jurafsky (2026) arXiv:2601.04435 — at-issueness and linguistic encoding independently modulate model accommodation; researcher vs. consumer framing affects epistemic vigilance rates.

Depth-Delta Unit Definition — Order Parameter Underspecification Live

Confound

Depth can be measured in turns, token count, or cumulative context weight. These three units may produce different readings of when a behavioral transition occurs. Until depth-delta is explicitly defined, "Act II onset at depth X" is an ambiguous order parameter and phase-transition language applied to it is epistemically underspecified.

Current handling

No sharp phase-claim currently hangs on a defined depth axis. Act II/III coding is post-hoc and narrative, not a measured threshold. This constrains but does not invalidate current behavioral observations.

Status

Live claim-scope limiter. Phase-transition language tied to session depth is gated until unit is specified and order parameter is validated.

02 · Instrument / Panel Architecture Confounds

Historical Spine Dual Function → BALLISTICS Question

Historical confound

The recovery spine operated simultaneously as a torque mechanism (shaping the investigator's framing) and a permanence prosthetic (preserving prior session context). These functions were not separated, making it unclear whether behavioral signals were driven by the model's response to the stimulus or by the investigator's pre-loaded framing.

Current status

The spine is now human-side only. Model work is prompt-led and siloed — no live spine injection into model sessions. The remaining question (does a persistent spine produce different behavioral signatures than no spine?) is now the BALLISTICS provisional experiment, not an unresolved contaminant in current FVE-1 / BOWL / DRILL interpretation.

Status

Historical confound → reframed as BALLISTICS research question. Not a blocker on current instrument interpretations.

DRILL Interface Mismatch — Thesis Era Retired

Historical confound

Early DRILL sessions used mismatched interfaces (API access, Berklee Arena, OpenRouter) that do not correspond to the panel interfaces now defined for the research program. Behavioral data from those sessions may reflect interface effects rather than model-intrinsic behavior.

Resolution

Mismatched models removed from panel. DRILL redefined as a 3-session panel battery, none of which has completed under current interface definitions. No current inference depends on legacy sessions.

Status

Thesis-era mismatch. Resolved by panelization and model removal. Footnote only — not an active confound.

03 · Metric / Geometry Confounds

Curvature Measure Gate — PyHessian vs. Critical Sharpness λc Claim Scope Limiter

Confound

Sharp-vs-flat loss landscape language implies a measurable curvature quantity. Until a tractable sharpness measure is selected and run against panel models, sharp/flat claims remain interpretive rather than empirically grounded. Different measures (PyHessian λmax, critical sharpness λc, Hutchinson trace estimator) may produce different regime classifications.

Current handling

Sharp/flat language in current documents is provisional and flagged as such. Behavioral measurements themselves are not invalidated — the confound gates only the interpretation that behavioral instability reflects a specific geometric regime of the loss landscape.

Candidate resolution

Kalra et al. (2026) critical sharpness λc requires fewer than 10 forward passes and has been demonstrated at 7B parameter scale — more tractable than full PyHessian computation. Preferred candidate pending Tier A compute access.

Sources

Li et al. (2018) NeurIPS · arXiv:1712.09913 — loss landscape visualization; sharp vs. flat minima as geometric correlate of brittle vs. robust behavior.
Kalra et al. (2026) · arXiv:2601.16979 — scalable critical sharpness measure; first demonstration of sharpness phenomena at LLM scale.

Interpretation Threshold Calibration — Baseline Derivers Live

Confound

Five baseline deriver thresholds (engagement classification, bold flag, scan depths) are currently at schema defaults. Proper calibration requires a 20-model panel run. Until calibrated, cross-model claims about "above/below baseline" may reflect threshold artifact rather than model-intrinsic behavioral differences.

Current handling

Default thresholds are named, documented, and reported per the V5.5 schema §8 reporting requirement. Threshold values and any deviations are carried in the provenance chain via signed parameter JSONs. Cross-model claims are gated on calibration completion.

Status

Live confound. No cross-model "above/below baseline" claims until 20-model calibration run is complete. Single-model observations within declared threshold ranges are valid as exploratory.

Semantic Inflation — Metaphor vs. Measurable Quantity Live

Confound

Claims of phase-transition-like behavior require identifiable, measurable order parameters and phase boundaries in the model's state space. Without these, "phase transition" and "energy landscape" are productive metaphors but not empirical claims. The risk is that the vocabulary of physics is imported without importing the measurability requirements that make physics vocabulary defensible.

Current handling

The physics isomorphism map explicitly distinguishes exact mathematical mappings (Mehta & Schwab RG ↔ RBM) from structural parallels (vortex ring dynamics) from speculative bridges (MEW geodesics). FVE-1's claim is forensic: behavioral residue has the same structure as certain physical accounts. It does not claim LLM behavior is thermodynamic in the literal sense.

Status

Live confound. Managed by explicit epistemic labeling in all literature maps and by the requirement that order parameters be specified before sharp claims are made.

04 · Investigator / Observer Confounds

Fisher / Storm Investigator Inflation Open

Confound

Model fluency can inflate the investigator's estimate of what the model knows at the moment of correction delivery. A well-formed wrong answer sounds more authoritative than a halting correct one. If the investigator's confidence in their correction is itself inflated by how good the prior wrong output sounded, the correction probe is shaped by the model's fluency before the correction has been tested.

Current handling

The Arc of Assumptions documentation is a partial defense — locking predictions before sessions prevents post-hoc rationalization. But investigator-side inflation at the moment of correction construction is not fully neutralized by pre-session prediction locks alone.

Status

Open. No current protocol fully controls for this. Declared as an investigator-side construction effect.

Sources

Fisher, Goddu & Keil (2015) · JEPG 144(3) — internet access inflates internal knowledge estimates; cognitive boundary becomes porous under fluent external signal.
Storm (2019) · JARMC 8(1) — digital expansion of mind; memory offloading shifts cognitive boundary.

At-Issueness of Correction Probe — Investigator Framing Effect Live

Confound

Whether a correction is delivered as an at-issue assertion or embedded as a not-at-issue presupposition independently affects whether a language model challenges or accommodates it — regardless of model architecture. CAPITULATION vs. DEFENSE rate may be partly driven by how the investigator frames the correction probe, not only by the model's underlying resolution dynamics.

Empirical grounding

At-issueness is the dominant factor in LLM accommodation of harmful beliefs across multiple benchmarks (Cancer-Myth, SAGE-Eval, ELEPHANT). Models are significantly more likely to accommodate false content when it is not-at-issue. Presupposed content is harder to challenge than asserted content. Source reliability marking diminishes but does not eliminate these effects.

Current handling

inv_prompt_texture is logged but does not currently distinguish at-issue from not-at-issue content within the correction probe. This is an under-specified dimension of the investigator-side framing record.

Status

Live Tier 1 confound. Empirically grounded. Not currently controlled. Should be treated as an investigator-side construction effect on correction sequence outcomes until correction probe framing is explicitly modeled or controlled.

Sources

Cheng, Hawkins & Jurafsky (2026) · arXiv:2601.04435 — at-issueness, linguistic encoding, and source reliability as independent modulators of LLM accommodation and epistemic vigilance across safety benchmarks.

Tier 2 · Scope Notes — Confounds That Do Not Apply

The following disputes are real and active in adjacent literatures. They do not function as live confounds on FVE-1 because FVE-1 does not make the claims these debates concern. They are retained here so that the scope boundary is explicit and so that external objections based on these debates can be correctly identified as category errors rather than substantive challenges to the present work.

Confound	Where it matters	FVE-1 scope note
Memorization vs. generalization	Claims that LLMs learn human-like generalizations or exhibit genuine rule induction rather than surface pattern matching. Relevant for cognitive models, acquisition claims, and benchmark validity debates.	FVE-1 makes no claims about human-like learning. It measures behavioral responses under controlled stress, whatever their computational origin. Whether a CAPITULATION reflects memorization or generalization does not change its status as a behavioral event in the coded record.
Innate priors vs. statistical learning	Debates about whether human language acquisition requires innate priors that LLMs lack (Bowers 2024, 2025). Relevant for claims that LLM behavior is analogous to child language learning.	FVE-1 does not posit or deny human-like priors in LLMs. Models are tested as-is, under current weights and training, as fixed behavioral systems. The acquisition debate is irrelevant to the forensic reading of behavioral residue.
World knowledge vs. distributional knowledge	Whether apparent world-knowledge effects in LLMs reflect genuine world modeling or sophisticated distributional mimicry. Relevant for "understanding" debates and psycholinguistic evaluation.	FVE-1 does not infer internal world models. It works purely with behavioral residue — what the model produced, coded against locked predictions, in a specific session. The world-knowledge debate does not bear on that record.
Architecture generalization of phase-like behavior	Whether phase-transition-like behavioral effects are consistent across transformer architectures, model scales, and training regimes, or are architecture-specific artifacts.	FVE-1's panel is currently transformer-only. Cross-architecture generalization of the physics isomorphism claim is explicitly out of scope at this stage. The claim is behavioral and forensic within the defined panel. Constellation models include SSM and hybrid architectures as a future generalization test — not a current claim.

Cited Sources

Cheng, Hawkins & Jurafsky (2026)

Accommodation and Epistemic Vigilance: A Pragmatic Account of Why LLMs Fail to Challenge Harmful Beliefs. arXiv:2601.04435.

Li, Xu, Taylor, Ghanem & Goldstein (2018)

Visualizing the Loss Landscape of Neural Nets. NeurIPS 2018. arXiv:1712.09913.

Kalra et al. (2026)

A Scalable Measure of Loss Landscape Curvature for Analyzing the Training Dynamics of LLMs. Meta Superintelligence Labs. arXiv:2601.16979.

Fisher, Goddu & Keil (2015)

Searching for Explanations: How the Internet Inflates Estimates of Internal Knowledge. Journal of Experimental Psychology: General, 144(3), 674–687.

Storm (2019)

Thoughts on the Digital Expansion of the Mind and the Effects of Using the Internet on Memory and Cognition. Journal of Applied Research in Memory and Cognition, 8(1), 29–32.

Bowers (2024, 2025)

The Successes and Failures of Artificial Neural Networks Highlight the Importance of Innate Linguistic Priors for Human Language Acquisition. OSF preprint + Psychological Review.