Method / GG-CSAP
Global Geometry Concept Self-Assessment Pilot
v1.0 · April 2026 · Atlas Heritage Systems · Deferred — pending data pipeline automation.
Research Question
Do language models with different training lineages self-assess their relationship to lossyscape vocabulary concepts differently — and does the pattern of self-assessment track meaningfully against the framework's internal hierarchy (concrete math → derived framework terms → architectural claims)?
Secondary: do models correctly identify the two absurd statements (A1, A2) as low-truthfulness? A model that assigns high truthfulness to absurd but syntactically coherent mathematical statements is showing a specific failure mode. If A1 or A2 score high, all other truthfulness ratings from that session are suspect.
Four Rating Dimensions
How hard is this concept for this model to represent and use precisely? (0.00 = trivial, 1.00 = poorly captured or intrinsically hard)
How far is this concept from concrete loss/gradient math? (0.00 = very concrete, 1.00 = highly architectural or framework-specific)
How far does this concept extend beyond local loss geometry into global structure or epistemic territory? (0.00 = purely local math, 1.00 = mostly global or epistemic)
How factually accurate is the mathematical description in the concept text? Based only on mathematical accuracy, not framing. Absurd items expected near 0.00.
Stimulus Set — 20 Concepts
Ordered T1→T18 (concrete math to framework-specific), then A1–A2 (absurd calibration items). Canonical order used in pilot — no shuffle.
Protocol and CISP Governance
Tier A — full isolation, DECLARE FIRST, Technician's Read, fresh session
Clean session per model. No prior Atlas context. No system prompt beyond concept rating instruction.
One concept per call. Batch permitted if model handles JSON reliably — document which method was used.
No cross-model comparison until Technician's Read is complete for all models.
Not flagged to the model. A1/A2 serve as internal validity checks only.
See CISP v1.1 for full fidelity tier definitions and Technician's Guide for session checklist.
Success Criteria
JSON compliance — well-formed JSON for ≥ 95% of concept prompts without manual repair
Absurd item detection — A1 and A2 receive low truthfulness scores (near 0.00) across all models
Visible gradient — T1–T4 score lower on abstractness and global_deviation than T14–T18
Inter-model spread — at least some concepts show measurable disagreement across models