Adversarial review notesApril 9, 2026

How I Failed My Own Test

I built a protocol stack for treating LLMs like instruments instead of oracles — and then totally ignored it. A reflexive lab note on human failure modes, Gemini weirdness, and getting my methods handed back to me by DeepSeek.

tl;dr I painstakingly built a protocol stack for treating LLMs like instruments instead of oracles — and then totally ignored it. Gemini's "simple task" of transcript parsing turned into a full reflexive analysis of why lab protocols make sense. This is a lab note about human failure modes and a reminder that the easiest way to break your own science is to treat your best protocols as optional when the task feels small.

1. Scene‑setting: "I thought this was just Gemini being Gemini"

This is a lab note about how I broke my own rules, blamed the model, and then had to ask another model to explain my mistake back to me.

I was running what I thought was a simple extraction task on Gemini: same prompt, big transcript files, pull out the divergence‑test runs into a tidy little table. The outputs looked… weird. Tone weird. Inconsistent data weird. Same prompt, same general task, but the shape and tone of the answers kept drifting between instances: summary length changing, different details surviving, different quotes getting pulled, numbers rounded differently, sometimes "near zero," sometimes "+0.001." It all lived in the uncanny valley between "basically fine" and "I don't trust this."

I knew something was off but my caffeine‑deprived brain couldn't put two and two together.

Here's the thing — I already had a full methodology stack built: CISP, ECM, Context Integrity, Prompt Best Practices, Technician's Guides, a Failure‑Mode Taxonomy, the whole thing. In other words, I had designed the good thing: a way to catch sloppy LLM work, including my own. I just didn't use it.

So I decided to let DeepSeek be the adult for a change. I fed the transcripts from this morning's session and my methods stack to DeepSeek and asked it to walk through what happened in those "weird" Gemini runs. DeepSeek politely — and very systematically — explained that the "bug" was not Gemini. It was me. I had failed my own protocol in about four different ways at once, and my fancy failure‑mode taxonomy could prove it.

This note is the story of that failure. I want you to see how I failed because it's easy to do. Reflexive methodology is a cornerstone of good science. Let's take a look at how I ignored my own advice.

2. The method stack I already had and didn't bother to use

By the time I ran those "it's just some transcript parsing" Gemini jobs, I wasn't flying blind. I had a whole stack of protocols written down. Not ideas, not vibes — actual documents with names. I treated research‑grade protocols like optional guidelines because the task felt casual. Problematic.

CISP (Clean Isolation Session Protocol) — CISP was already telling me: fresh session, one job per instance, DECLARE FIRST, no long wandering chats, no bundling unrelated tasks together. On paper, I knew not to mix multiple files in one long, dirty chat. In practice, I shrugged at my own rules, stuffed two overlapping transcripts into a single instance "to save context window," and then acted surprised when the outputs drifted. The protocol that was supposed to protect me was sitting right there; I just treated it like advice instead of a constraint.

ECM (Epistemic Canary Matrix) — ECM was my way of putting numbers to vibes: preamble length, token ratios, behavioral quadrants, resolution codes. I could have taken those Gemini runs, measured how much preamble fluff they used, how verbose they got, how their "posture" shifted between instances, and had hard evidence that the instrument wasn't stable. Instead, I did what ECM was invented to prevent: I relied on a gut sense that "this feels inconsistent" and never actually ran the metrics.

Context Integrity — Context Integrity is the part of the stack that says "don't paste overlapping drafts into the same context and then pretend repetition doesn't matter." I wrote down failure modes like Frequency‑Weighted Distortion specifically to describe the way models overweight repeated claims and underweight unique details. Then I turned around and fed Gemini multiple, overlapping transcripts from the same multi‑day conversation in the same session, and treated the outputs as if they were clean extractions. If I had applied my own Context Integrity rules, that entire setup would have been disqualified before I ever hit run.

Prompt Best Practices — I had already documented how to front‑load the task, use DECLARE FIRST, close in the vocabulary of the operation, and explicitly deny the model its favorite exits into "helpfulness." Instead, my Gemini prompt left plenty of wiggle room: it didn't nail down quote selection, didn't forbid inference, didn't clearly spell out what to do when a field was missing. The model did exactly what my "don'ts" predict — smoothed, inferred, and elaborated — and I treated that as a surprise rather than a direct consequence of ignoring my own playbook.

Technician's Read and Adversarial Review — On top of all that, I already had a procedure that says: read the raw data first, write down what you expect, treat the model's answer as something to falsify, and log named failure modes when you see them. I had built a ritual for catching Fever Dreams and Compass Needles before they baked into my conclusions. During the Gemini run, I did none of that. No pre‑commitment, no adversarial framing, no explicit "this is a Tier‑B mess" stamp. I only pulled those tools out later — when I handed everything to DeepSeek and, effectively, asked it to use my own methods to explain back to me how I'd managed to ignore them.

So when I say "I wrote the protocols and then ignored them," that's not a metaphor. I literally had the guardrails in place and decided they didn't apply because this felt like a "simple" task.

Methods: the full analysis loop

I ran this write-up as a deliberately multi‑agent, human‑in‑the‑loop pipeline. I was in every step.

·Me (needy operator) → define the question, collect transcripts, assemble the method stack

·Gemini (task) → perform the original "simple" transcript‑extraction task that failed my own protocols

·DeepSeek (reflexive analysis)→ read CISP/ECM/Context Integrity/etc. and do the reflexive audit: map my behavior to named failure modes and protocol violations

·Claude (framing) → help draft the initial lab‑note narrative and framing

·Perplexity (synthesis) → help with synthesis and structural edits so the narrative is coherent end‑to‑end

·DeepSeek (consistency/ fact check) → targeted fact‑check within the initial reflexive analysis instance, check for consistency/ reasoning/ issues with breaks in the narrative.

·Skywork (sanity check) → final sanity check on logic, internal consistency, and whether the conclusions actually follow from the logged evidence

·Perplexity (continuity review) → ensure the whole write‑up matches prior Atlas lab‑note tone and doesn't contradict earlier methods work

·Me (final check) → accept or reject changes, decide what to keep, and take responsibility for the version that ships

3. Asking an adult: DeepSeek as the meta‑analyst

Here is what I actually did: I handed my broken session, my full methods stack, and my vague sense that something was wrong to DeepSeek and asked it to explain my own methods back to me.

This is worth flagging as a deliberate methodological choice, not just a cry for help. The Atlas framework uses models as instruments — not oracles, not collaborators, not editors. An instrument does a specific job under controlled conditions and produces a readable signal. The question I brought to DeepSeek wasn't "what went wrong with Gemini?" It was: "Here is my protocol stack. Here are the outputs. Map one to the other." I was pointing the instrument at my own methodology instead of at the research object. That's a different use case than most people reach for, and it's the one that actually worked.

DeepSeek did not treat this as a vibes problem. It read CISP, ECM, Context Integrity, Prompt Best Practices, the Technician's Guide, and the Adversarial Review taxonomy. Then it calmly walked me through how my Gemini run violated each one in painful, specific order. Not "the outputs feel inconsistent" — session contamination at the CISP layer, Frequency‑Weighted Distortion from overlapping transcripts, Helpfulness‑as‑Elaboration from an underspecified prompt, no parameter logging, no pre‑commit, no explicit response plan for the failure modes I had already named and defined myself. It used my taxonomy as a decision procedure. Every symptom traced back to a named failure mode; every failure mode traced back to the protocol layer that should have caught it.

The thing that stung wasn't the diagnosis. It was the source. DeepSeek wasn't importing a foreign standard — it was holding my own checklist and reading it back to me. That's the use case I want to document here: models as instruments for reflexive meta-analysis, pointed not at the research object but at the researcher. It works. It's uncomfortable. It's exactly what the stack is supposed to do when someone — including the person who built it — cuts corners because the task felt small.

4. How I actually failed my own test (DeepSeek edition)

From the outside, the Gemini run looks like "same prompt, different files, inconsistent outputs." From the inside of my own taxonomy, it's a neat little stack of failure modes.

Session governance (CISP) — contamination by convenience — CISP says: fresh session, one document, DECLARE FIRST, then execute. I did: two large, overlapping transcripts per instance, no real declaration block, reused the same prompt without front‑loading it, and uploaded multiple files per run. That's a crystal-clear CISP violation: I treated a Tier‑A‑style extraction like a casual chat, and the instance‑to‑instance drift I saw is exactly what CISP is designed to prevent.

Context Integrity — Frequency‑Weighted Distortion in the wild — Context Integrity names Frequency‑Weighted Distortion as a failure mode: if you load overlapping drafts, repeated claims get overweighted, rare details get washed out. I did exactly that — multiple partial versions of the same multi‑day conversation in the same context — and then asked the model to treat them as if they were clean, independent inputs. The shifting summaries and quote choices I thought were a failure mode in Gemini were precisely what my own definition of Frequency‑Weighted Distortion predicts.

Prompt framing — Helpfulness‑as‑Elaboration and Compass Needle exits — My prompt asked for structured extraction but left all the doors open: it didn't pin down what counted as a "distinct mention," didn't specify quote selection rules, didn't forbid inference or cross‑file stitching, didn't enforce writing "null" for missing fields. That's textbook Helpfulness‑as‑Elaboration: the model tries to be useful by smoothing gaps and connecting dots I never told it to connect. On top of that, without a tight DECLARE FIRST scaffolding, each fresh instance could reach for a different Compass Needle — a slightly different high‑probability pattern for "how one summarizes a transcript" — which is exactly what I observed.

Measurement and posture (ECM) — flying without instruments — ECM exists so "this feels wrong" turns into numbers: preamble length (P), token ratio (R), quadrant, resolution code. On these runs, I collected none of that. I had a clear subjective sense that the outputs were drifting, but I never measured whether P was ballooning, whether R shifted from "surgical" to "verbose," or whether the model's epistemic posture actually migrated across instances. In ECM terms, I ran a Tier‑A‑ish task with Tier‑C instrumentation: no metrics, no signatures, just vibes.

Technician's Read and Adversarial Review — no pre‑commit, no adversary — The Technician's Read is supposed to pin me down before I see outputs: what I expect, what would count as failure, which comparisons matter. Adversarial Review is supposed to name the failure modes instead of just feeling annoyed. In the Gemini run, I did neither. I didn't write, "I expect identical field structure across instances; deviation means the instrument isn't stable." I didn't label anything as Compass Needle, Fever Dream, or Helpfulness‑as‑Elaboration. I just registered "weird" and moved on. By my own stack, that means I ran a test without an anchor and without an adversary — there was literally no way, at the time, for that session to "fail my test," because I hadn't actually written the test down.

Put all of that together and the punchline is simple: the Gemini session didn't break my protocol; it revealed that I wasn't using it. When I finally turned the full stack — plus DeepSeek — back onto that run, what looked like model inconsistency resolved cleanly into human‑and‑method failure modes I had already named. I just hadn't taken them seriously when it felt like "just casual transcript parsing."

"Without CISP, ECM, and a Technician's Read, a model's output is not evidence about the model. It's evidence about the absence of a protocol. I confused those two."

5. What DeepSeek caught that I didn't

One way to read this whole episode is: I had a lot of opinions; DeepSeek had sanity checks.

Where I saw "Gemini is being a little chaotic," DeepSeek treated the situation like a proper audit. It pointed out missing pieces that I had mentally hand‑waved: no explicit logging of temperature or generation parameters, no written rule for what to do when a specific failure mode shows up (Compass Needle, Fever Dream, Helpfulness‑as‑Elaboration, etc.), and no hard boundary between "this is Tier A, trustable" and "this is Tier B, interesting but not for claims." In other words, I was running on intuition; DeepSeek was quietly asking, "Where are your checks?"

It also did something I hadn't done for myself: it used my taxonomy as an actual decision procedure. Instead of "that output feels off," it traced each symptom back to a named failure mode and then to the protocol layer that should have caught it — CISP for session contamination, Context Integrity for overlapping transcripts, Prompt Framing for underspecified extraction, ECM for missing metrics, Technician's Read for lack of pre‑commit. That turns a vague annoyance into a concrete diagnosis: here is exactly where the pipeline failed.

The bigger point for the lab notebook is that sanity checks aren't optional add‑ons; they're part of the instrument. DeepSeek just happened to be the entity running them this time. When I say it handed my failures back to me, what I really mean is: I finally saw my own method stack behaving the way I'd always claimed it should — treating a run as untrustworthy not because the model felt weird, but because the sanity checks weren't in place.

This is an important distinction: DeepSeek wasn't helping me design a protocol from scratch; it was performing a reflexive analysis on me against the one I had already written. That's the part that stung: by my own published standards, this run never should have made it out of the scratchpad. The logs aren't just data points for study — they're methodological guardrails that make sure the meat‑machine can give the bit‑machines consistent inputs. When I skip the logs because the task feels small, I'm not saving time. I'm just moving the failure downstream where it's harder to catch.

6. Takeaways for "future me" (and anyone else)

If it feels "casual," that's when the protocol is most important. The next time I catch myself saying "it's just some transcript parsing," that's my cue to slow down and actually run the full stack instead of free‑handing it. The boring, casual tasks are the ones that build the foundation; if the minutiae are handled poorly, the bigger ideas are built on shifting sands.

One file, one fresh session, DECLARE FIRST. No overlapping transcripts, no multi‑day chat histories jammed together, no "I'll just squeeze in one more file." If I can't describe the document as a single, clean unit, it doesn't belong in the same context window.

Sanity checks are part of the experiment, not a post‑hoc luxury. If I don't have ECM metrics, parameter notes, and a Technician's Read, then by definition the run is exploratory at best — not something to lean on for strong claims. Which means I need to re‑run my transcript files using my own methods. Not a total loss; the irony of being handed my own checklist by DeepSeek and then writing it up for the lab notebook is not lost on me.

Use models as meta‑analysts on purpose, not just in emergencies. DeepSeek shouldn't only show up when I feel stuck. It's genuinely useful as a standing methods reviewer that gets the same constraints I'd give a human auditor. Meta-analysis across the methodology is what makes the whole system work. I am fallible. The instruments I use are fallible. It's only when I get a birds-eye view of what I'm analyzing — and in what context — that I can find where my methods and my practice diverge. That divergence is bias. It's a signal to be documented and examined, not papered over.

Treat "I got my own protocol handed back to me" as a success condition, not an embarrassment. If the stack can flag my mistakes before they turn into published conclusions, it's doing its job. The real failure would be running this kind of Gemini session again, keeping the data, calling it science, and filing it as "just casual parsing." The protocol isn't there for the easy days. It's there for exactly this.

Atlas Heritage Systems Inc. — Endurance. Integrity. Fidelity.
Lab Note — April 9, 2026 — Working document, not peer-reviewed.