Loss Landscape Vocabulary Framework

v13 · April 2026 · Atlas Heritage Systems · Working document — not a finished product

A note before the math

You don't need to understand any of this to read the Framework. But if you want to know why the Framework is built the way it is, the math is where the answer lives.

When a language model trains, it moves through a mathematical landscape — hills, valleys, flat plains — searching for the lowest point. The vocabulary on this page describes the features of that terrain: what makes one region harder to cross than another, what gets preserved in the difficult parts, and what gets smoothed away in the easy ones.

The archaeological claim Atlas makes is simple: the hard parts leave marks. Those marks are readable. That's what the instruments are built to find.

Start with the plain language description of each term. Follow the math when you need it.

How it all fits together

The Framework names the terrain. The instruments measure behavior on it. The schema defines how measurements get recorded. The protocols govern how they're taken — CISP is the governance layer that sits above every active instrument run, enforcing isolation, sequencing, and the human-judgment boundary.

Below the protocols, the automation layer handles transcription: parsing raw model output, computing what can be computed, and leaving blank what requires a Technician's call. Below that is the data the instruments produce over time — the actual record Atlas is building.

The geometry sits at the end of the chain. PyHessian doesn't measure behavior; it measures the mathematical terrain the Framework describes. When there's enough data, the Hessian eigenvalue analysis will either confirm the Framework's terrain claims or force a revision. Working hypotheses stay hypotheses until the math has something to argue with.

Navigator Properties

The model's dynamic relationship to terrain — how it moves through, resists, accumulates history, and distributes probability mass. Conjugate to terrain properties: precise measurement of one axis structurally degrades precision in the other. Note: Skywork adversarial review identified that the seven qualifiers collapse to three independent variables (density, coupling, elasticity) plus four derived readouts (perplexity, probability, viscosity, memory).

Densitycoarse / fine

Training data coverage across input space. Coarse means sparse coverage, weak gradient signal. Fine means dense coverage, steep well-defined valleys.

D_KL(P_data ‖ P_model)
Flagged: KL divergence measures model inadequacy to data, not parameter-space coverage directly

Kullback & Leibler (1951)

Perplexityhigh / low

Average surprise. Exponential of cross-entropy. High perplexity marks unmapped or contested terrain. The scar tissue of turbulent training lives in high-perplexity regions of a deployed model.

PP(W) = P(w₁...wN)^(−1/N) = 2^H(W)

Shannon (1948); Manning & Schütze (1999) ch.3

Probabilityhigh / low

Output distribution sharpness at inference. High probability outputs correspond to sharp narrow valleys. Low probability outputs correspond to flat regions or saddle points.

softmax P(x_i) = exp(z_i/T) / Σexp(z_j/T)

Bishop (2006) pattern recognition ch.4

Couplinghigh / low

Inter-parameter dependency. How much moving one weight moves others. High coupling means parameter updates propagate widely. Causally determines viscosity via eigenvalue calculation.

H_ij = ∂²L/∂θ_i∂θ_j (off-diagonal Hessian) Approx: Fisher information matrix F

Sagun et al. (2017); Dauphin et al. (2014)

Viscosityicky / not-icky

Resistance to movement under gradient pressure. Icky means high resistance — flat wide minima, competing orientations persist longer. Determined by coupling via eigenvalue spectrum.

Hessian eigenvalue spectrum {λ_i} Low λ = flat = icky; High λ = sharp = not-icky
Flagged: eigenvalue spectrum is causally determined by coupling — not an independent variable

Keskar et al. (2016); Foret et al. (2020) sharpness-aware minimization

Elasticitystretchy / not-stretchy

Restoring force toward prior weight configurations after perturbation. Catastrophic forgetting is total loss of elasticity.

L2: λ‖θ‖² EWC: L_total = L_task + λΣF_i(θ_i − θ_i*)²

Kirkpatrick et al. (2017); Krogh & Hertz (1992) weight decay

Memorydeep / shallow

Path dependency encoded in weights. Not the weights themselves — the history of how the model traveled through the loss landscape during training. Cannot be measured independently from viscosity in a frozen deployed model.

training trajectory integral; basin selection via initialization and curriculum
Flagged: memory/viscosity distinguishability test unperformed — central open experiment

Li et al. (2018); Goodfellow et al. (2014)