Loss Landscape Vocabulary Framework

v13 · April 2026 · Atlas Heritage Systems · Working document — not a finished product

A note before the math

You don't need to understand any of this to read the Framework. But if you want to know why the Framework is built the way it is, the math is where the answer lives.

When a language model trains, it moves through a mathematical landscape — hills, valleys, flat plains — searching for the lowest point. The vocabulary on this page describes the features of that terrain: what makes one region harder to cross than another, what gets preserved in the difficult parts, and what gets smoothed away in the easy ones.

The archaeological claim Atlas makes is simple: the hard parts leave marks. Those marks are readable. That's what the instruments are built to find.

Start with the plain language description of each term. Follow the math when you need it.

How it all fits together

The Framework names the terrain. The instruments measure behavior on it. The schema defines how measurements get recorded. The protocols govern how they're taken — CISP is the governance layer that sits above every active instrument run, enforcing isolation, sequencing, and the human-judgment boundary.

Below the protocols, the automation layer handles transcription: parsing raw model output, computing what can be computed, and leaving blank what requires a Technician's call. Below that is the data the instruments produce over time — the actual record Atlas is building.

The geometry sits at the end of the chain. PyHessian doesn't measure behavior; it measures the mathematical terrain the Framework describes. When there's enough data, the Hessian eigenvalue analysis will either confirm the Framework's terrain claims or force a revision. Working hypotheses stay hypotheses until the math has something to argue with.

Global Geometry

Properties that only appear when you look at the loss landscape as a whole: how minima connect, when behavior changes regime, and which weight configurations are secretly the same solution. These are not local qualifiers — they live between basins and across training trajectories. No single-point terrain or navigator measurement can see them. Promoted from coverage gaps (v11) to first-class terms (v12) following Nemotron-3-Super-120B adversarial review, April 2026.

Basin Connectivityconnected / isolated

Whether two minima are joined by a low-loss path in parameter space. Connected basins admit smooth interpolation between solutions; isolated basins require climbing a high-loss barrier to move between them. Connectivity is a global property no local qualifier can see — whether an apparently separate minimum is part of a larger connected valley is invisible from inside either minimum. A key open question for the Atlas framework: are the archaeological sinks isolated basins or branches of a connected low-loss manifold?

mode connectivity path γ(t) with L(γ(t)) ≤ L_max along 0 ≤ t ≤ 1

Garipov et al. (2018) loss surfaces, mode connectivity, and fast ensembling; Draxler et al. (2018) essentially no barriers in neural network energy landscape

Symmetry Orbitsequivalent / distinct

Families of weight configurations that implement the same function because of architectural symmetries. Permuting neurons, rescaling layers with inverse compensation, or flipping sign conventions can move the model far in parameter space while leaving behavior unchanged. Symmetry orbits explain why many distinct-looking minima are functionally the same basin. Distance in parameter space is not distance in function space without accounting for symmetry orbits first.

orbit([W]) = { σ(W) : f_σ(W)(x) = f_W(x) ∀x } σ ∈ symmetry group of the network

Entezari et al. (2022) the role of permutation invariance in linear mode connectivity; Ainsworth et al. (2022) git re-basin: merging models modulo permutation symmetries

Phase Transitionspre- / post-

Regime shifts in model behavior that emerge as a training control parameter crosses a threshold, often with smooth loss but discontinuous internal structure. Grokking is the canonical example — performance snaps from memorization-like to generalization-like after extended training at nearly constant loss. Phase transitions are global training phenomena, not single points in the landscape. The loss surface does not signal them. Behavioral metrics (accuracy, representation similarity) do. The transition happens in algorithm space while the terrain barely moves.

behavioral order parameter Φ(θ) (e.g., generalization gap) phase transition at λ* where ∂Φ/∂λ jumps while L(θ, λ) stays smooth

Power et al. (2022) grokking: generalization beyond overfitting; Nakkiran et al. (2019) deep double descent: where bigger models and more data hurt

Algorithmic Branchesshared / disjoint

Distinct functional algorithms that live on the same broad basin. Two checkpoint paths can converge to similar loss and perplexity while encoding different internal computations — different circuits, different attention patterns, different representation geometry. Branch structure captures the fact that a single connected region of low loss can contain multiple algorithmic solutions. Where regime switch describes when a model changes algorithm during training, algorithmic branches describe co-existing algorithms within the same low-loss region after training.

branch i: { θ : L(θ) ≈ L* and A(θ) = A_i } where A(θ) is an algorithmic descriptor (e.g., CKA cluster, circuit assignment)

Zhang et al. (2021) can you learn an algorithm?; related work on functional clustering of representations and mechanistic interpretability circuit analysis