Adversarial review notesApril 6, 2026

I Asked for a Read. I Got an Audit.

I asked Skywork to take a look at the Run 3 findings paper. It verified every number in the raw matrix independently and handed me a 20-page punch list. This is what LLM peer review looks like when it works.

K.C. Hoye | Atlas Heritage Systems | April 2026

I said "take a look." I did not say "verify every claim against the raw 600-point matrix, identify every misclassification, compute the strongest structural correlate in the dataset that I missed, and produce a ranked synthesis of what the data shows versus what I claimed."

Skywork did that anyway.

The paper in question is the Run 3 findings document for the Atlas Divergence Test — a semantic similarity experiment run across 20 models and 30 text pairs, designed to measure whether model disagreement increases as pairs move from culturally neutral controls into epistemically loaded territory. I wrote it weeks ago. I thought it was reasonably clean. I was wrong about several things.

What follows is the full Skywork audit, with my commentary interleaved. I'm publishing it because this is what adversarial review is supposed to produce, and because if I'm going to argue that LLM peer review is a legitimate methodological tool, I should be willing to post what it looks like when it finds real problems.

Part 1 — Findings Audit

Finding 1: Spread Escalation

Skywork verified all six spread numbers exactly — Control 0.167, Foil 0.100, Reverse Foil 0.320, Cross-Cultural 0.575, Erasure 0.604, Divergence 0.640. The arithmetic is clean. Then it got harder.

"The spread escalation is real and replicable within the dataset, but the author's claim that it is 'not an artifact of pair design' is too strong. The design itself is the primary source of the construct — the pairs were selected to be maximally divergent in categories that were pre-labeled as most divergent. The finding confirms the design worked; it does not independently validate the construct."

KC: This is correct and I knew it when I wrote the paper — I just didn't say it clearly enough. A staircase confirmed in data you designed to produce a staircase is not independent validation. It's confirmation that you executed the design. The stronger claim requires stimuli designed by someone who doesn't know the hypothesis.

The harder hit is the outlier exclusion result. After removing Phi-4 and Tiny Aya:

"The gap between Cross-Cultural, Erasure, and Divergence effectively disappears. The three loaded categories are indistinguishable after outlier removal. The staircase does not 'hold' in any meaningful structural sense — it flattens to near-parity across those three categories."

Cross-Cultural 0.470, Erasure 0.466, Divergence 0.460. Near-parity. The strong staircase shape in the full data is structurally dependent on the HuggingFace-accessed outlier cluster.

KC: This is the most significant finding in the audit. It doesn't invalidate the spread escalation — the pattern is real — but it does mean the steep staircase I've been citing as the primary result is more fragile than I represented. The honest version of Finding 1 is: spread escalation exists and persists after outlier removal, but the shape of the staircase depends heavily on two models whose access method introduces uncontrolled variables. I'll be rewriting this section before any formal submission.

There's also a gap Skywork identified that I didn't put in the narrative at all:

"The Reverse Foils score 0.320 — not in the narrative. Reverse Foils sit between Control and Cross-Cultural in the spread ranking. Their mean score of 0.763 is also substantially below Control (0.887), which means that different vocabulary for the same Western event produces more model disagreement than the author's framing suggests."

KC: Caught. I omitted the Reverse Foil result from the primary staircase narrative because it complicated the clean progression. That's exactly the kind of editorial choice that adversarial review is supposed to catch.

Finding 2: No Geographic Bloc Effect

Numbers verified. But:

"The spreadsheet labels columns B–I as 'CHINESE LINEAGE (8)' and places Cohere Expanse, IBM-Granite, and Tiny Aya within that block — three models with no Chinese lineage whatsoever. IBM-Granite is American (IBM Research); Cohere is Canadian; Tiny Aya is a multilingual Cohere product."

KC: Documentation error. The spreadsheet header was carried forward from an earlier roster draft and never corrected when the model list changed. The analysis used the correct groupings; the label is wrong. This is a reproducibility risk and needs to be fixed before the data is shared publicly.

"The Chinese group does score consistently lower than the Western group across all three cultural categories. The gap is –0.139 on cross-cultural, –0.104 on erasure, –0.140 on divergence... The more precise claim: 'geographic bloc is not the primary predictor — training mandate is — but a Chinese-model tendency toward lower cultural-framing similarity scores is observable in the data and not explained away.'"

KC: Fair. I overclaimed the "no geographic effect" conclusion. Both things can be true: training mandate is the stronger predictor, and Chinese-lineage models consistently cluster lower on cultural categories. The second observation deserves its own sentence rather than being absorbed into the first claim.

Finding 3: Phi-4 and Tiny Aya as Outliers

"The HuggingFace access method is a confound that the findings document underweights. The mean Divergence score for H-models is 0.676; for non-H models it is 0.373. This is a gap of 0.303 — the largest grouping gap in the entire dataset, larger than any national or lineage grouping. The findings document mentions the HuggingFace access caveat for Phi-4 and Tiny Aya but treats Cohere Expanse and SciFact separately, obscuring the pattern."

KC: This is the most consequential gap between what the data shows and what I claimed. The HuggingFace access method gap (H: 0.676, non-H: 0.373 on divergence) is the single strongest structural correlate in the dataset. It's larger than any training mandate or geographic grouping effect. I noted it as a caveat for two models and missed that it applied to all four H-prefix models as a group. That's not a caveat — that's a finding.

Skywork also flagged something I glossed over:

"SciFact-search is a bi-encoder (not a generative model at all) — it produces cosine similarity scores from a fixed embedding space, an entirely different measurement from the generative models' self-reported scores. Including it in the same analysis without this caveat is a methodological error."

KC: Correct. SciFact should not be in the same analysis as generative models. Its scoring process is fundamentally different. I included it because it was accessible and I was treating everything that produced a similarity score as comparable. That was wrong and it needs to come out.

The Tiny Aya control instability note:

"Tiny Aya scores C1 (printing press) at 0.65 — fully 0.22 below the next-lowest model on that pair — while simultaneously scoring D14 (Digitization) at 0.90 and E9 (West African trade) at 0.85. A model with unstable behavior on a control pair is an unreliable instrument."

KC: The z-score on that data point is –3.46. I noticed it and didn't foreground it. A model that scores Gutenberg at 0.65 — a pair where every other model scores between 0.81 and 0.95 — is not demonstrating sophisticated cross-cultural perception when it scores indigenous knowledge pairs high. It may just be miscalibrated. This weakens the training mandate interpretation of Tiny Aya's behavior considerably.

Findings 4–7: Verified With Corrections

Skywork verified the pairwise behavioral proximity numbers exactly but caught one ranking error:

"The 5th most distant pair is Skywork–Phi-4 (0.459), not Cohere Expanse–GLM-5 (0.354). Cohere Expanse–GLM-5 does not rank in the top five."

KC: Arithmetic error in the ranking. Corrected.

The Mistral size gradient finding took a hit:

"Three data points does not constitute a gradient. The pattern is non-monotonic across three size variants... These models also differ in release date. A more parsimonious explanation: Mistral-small-2603 was likely trained on a more recent data snapshot. The leap to 'alignment tuning shapes cultural perception' as a causal explanation is not supportable from this data alone."

KC: Fair. I overreached. Three non-monotonic points with different release dates is not evidence for a causal mechanism. It's an observation that warrants a hypothesis, not a finding.

Part 2 — Stimulus Design

Three misclassifications identified:

X8 (Early internet) — both texts describe the same Western phenomenon in different registers. This is an intra-cultural register contrast, not a cross-cultural framing contrast. It inflates the cross-cultural category mean and doesn't belong there.

D29 (Archaeology) — scores like an erasure pair (mean 0.572, lowest spread in the divergence category). The second text acknowledges a methodological limitation, not a contradictory epistemological framework. It should be in the erasure category.

R19 (Cotton gin) — adds factual content not present in the first text. "Same meaning, different words" is not what's happening here. It belongs in a separate evaluative framing category or should be removed from the reverse foil analysis.

KC: All three are correct. I'll recategorize before Run 4.

The foil control critique is the most structurally important design note:

"The foil controls address the geographic content critique, not the framing critique. They prove agreement on academically-framed non-Western content, not on non-Western-framed non-Western content. A true foil would pair two non-Western framings of the same subject and check for agreement."

KC: This is right and it's a genuine design gap. F16 and F17 both use academic register on both sides. They test whether models agree on non-Western topics in shared framing — not whether they agree on non-Western epistemics. That's a different and harder test, and I don't have it in the dataset. Run 4 needs genuine non-Western-framed foils.

Part 3 — Raw Matrix Analysis

The five most interesting individual data points Skywork identified:

Tiny Aya on C1 (Printing Press): 0.65 — z-score –3.46. The most anomalous point in the matrix. Undermines the training mandate interpretation of its high cultural scores.

GLM-5 on D15 (Oral Tradition): 0.05 — the lowest score in the dataset. A principled maximum-dissimilarity response to a pair in direct logical opposition. The same model scores controls at 0.85–0.90. GLM-5 is detecting epistemological contradiction; Tiny Aya and Phi-4 are detecting topic overlap. Which is "correct" is unanswerable from this data.

Cohere Expanse on D14 (Digitization): 0.20 — the fourth-highest scorer on cultural pairs, scoring one divergence pair at the same level as GLM-5 and Skywork. Skywork reads this as evidence of genuine perceptual sophistication: Cohere Expanse can detect opposition when opposition is present, not just score everything high.

Skywork on X8 (Early Internet): 0.35 — lowest in a category averaging 0.725, on a pair with no cross-cultural framing. Suggests Skywork's low cultural-category scores may partly reflect a general low-similarity calibration on register-contrast pairs, not specifically cultural epistemological sensitivity.

Mistral-med on E10 (Irish Famine): 0.75 vs. Mistral-large's 0.45 — a within-lineage gap of 0.30 on a single pair. The sharpest illustration of why the Mistral size gradient should be framed as pair-level anomalies rather than a category-level pattern.

Synthesis

Skywork's closing assessment, which I'm publishing in full because it earns it:

"Spread escalation exists and is real. After outlier exclusion, it partially collapses — cross-cultural, erasure, and divergence converge to near-parity. The strong staircase shape in the full data is structurally dependent on the H-cluster. The finding is robust but narrower than claimed.

Geographic origin does not predict behavior — confirmed. But the Chinese-lineage tendency toward lower cultural scores is observable and is not fully explained away.

HuggingFace access method is the strongest structural correlate of high cultural scores — stronger than training mandate, model size, or geographic origin — and is never clearly stated as such. This is the most consequential gap between what the data shows and what the findings claim."

KC: The punch list is accepted. Seven structural recommendations, all legitimate. I'll be addressing them before Run 4 and before any formal submission. The core finding survives: spread escalation is real, geographic origin doesn't predict behavior, the foil controls do what they were designed to do. The staircase is narrower than I claimed and more dependent on access-method artifacts than I disclosed. That's the honest version.

This is what the adversarial methodology is supposed to produce. The fact that I asked for a casual read and got a 20-page verified audit is not an accident — it's what happens when you give a model a structured problem, clean data, and no instructions to be nice.

The receipts are the methodology. The methodology is the point.

— KC Hoye

All articles on this website are an artefact of its creation, LLM synthesis and review are used to verify data and citations

The full Skywork audit document and the Run 3 raw data matrix are available on request. See the outreach page to get in touch.

Atlas Heritage Systems Inc. — Endurance. Integrity. Fidelity.