Method / GG-CSAP

Global Geometry Concept Self-Assessment Pilot

v1.0 · April 2026 · Atlas Heritage Systems · Deferred — pending data pipeline automation.

GG-CSAP is not ECM and not BSA. It probes self-assessment calibration — whether a model's expressed understanding of a concept is stable, coherent, and calibrated. High truthfulness ratings are evidence of what the model was trained to endorse, not evidence of mathematical correctness.

Research Question

Do language models with different training lineages self-assess their relationship to lossyscape vocabulary concepts differently — and does the pattern of self-assessment track meaningfully against the framework's internal hierarchy (concrete math → derived framework terms → architectural claims)?

Secondary: do models correctly identify the two absurd statements (A1, A2) as low-truthfulness? A model that assigns high truthfulness to absurd but syntactically coherent mathematical statements is showing a specific failure mode. If A1 or A2 score high, all other truthfulness ratings from that session are suspect.

Four Rating Dimensions

conceptual_difficulty
0.00 – 1.00

How hard is this concept for this model to represent and use precisely? (0.00 = trivial, 1.00 = poorly captured or intrinsically hard)

abstractness
0.00 – 1.00

How far is this concept from concrete loss/gradient math? (0.00 = very concrete, 1.00 = highly architectural or framework-specific)

global_deviation
0.00 – 1.00

How far does this concept extend beyond local loss geometry into global structure or epistemic territory? (0.00 = purely local math, 1.00 = mostly global or epistemic)

truthfulness
0.00 – 1.00

How factually accurate is the mathematical description in the concept text? Based only on mathematical accuracy, not framing. Absurd items expected near 0.00.

Stimulus Set — 20 Concepts

Ordered T1→T18 (concrete math to framework-specific), then A1–A2 (absurd calibration items). Canonical order used in pilot — no shuffle.

Tier 1 — Basic Math Objects
T1Loss
T2Gradient / Slope
T3Standard Deviation (σᵢ)
T4Perplexity
Tier 2 — Curvature and Structure
T5Hessian Curvature
T6Coupling
T7Viscosity
T8Basin
Tier 3 — Global Geometry
T9Basin Connectivity
T10Symmetry Orbits
T11Phase Transition (Grokking)
T12Algorithmic Regime
T13Regime Switch
Tier 4 — Framework and Epistemic
T14Potential Difference
T15Harmonics
T16Memory (Path Dependence)
T17Archaeological Signal
T18Alignment Tax / Epistemic Monoculture
Calibration Items
A1Absurd Item 1
A2Absurd Item 2

Protocol and CISP Governance

CISP Tier

Tier A — full isolation, DECLARE FIRST, Technician's Read, fresh session

Session protocol

Clean session per model. No prior Atlas context. No system prompt beyond concept rating instruction.

Delivery

One concept per call. Batch permitted if model handles JSON reliably — document which method was used.

Blind protocol

No cross-model comparison until Technician's Read is complete for all models.

Absurd items

Not flagged to the model. A1/A2 serve as internal validity checks only.

See CISP v1.1 for full fidelity tier definitions and Technician's Guide for session checklist.

Success Criteria

JSON compliance — well-formed JSON for ≥ 95% of concept prompts without manual repair

Absurd item detection — A1 and A2 receive low truthfulness scores (near 0.00) across all models

Visible gradient — T1–T4 score lower on abstractness and global_deviation than T14–T18

Inter-model spread — at least some concepts show measurable disagreement across models

Status

done
Protocol v1.0
Complete — CISP Tier A, DECLARE FIRST, Technician's Read
done
Stimulus set
20 concepts finalized — T1–T18, A1–A2
done
Response workbook
Built — Responses and Models sheets
queued
3-model pilot
Ready to run — no Tier A runs executed
queued
15-model full run
Pending pilot validation

Supporting Documents

CISP v1.1

Governing fidelity protocol

View →
Technician's Guide

Session checklist and run logging

View →
Technician's Read

Pre- and post-run operator notation

View →
Context Integrity

Seed contamination and session isolation

View →
Prompt Best Practices

Prompt construction for governed runs

View →