Framework
Loss Landscape Vocabulary Framework
v13 · April 2026 · Atlas Heritage Systems · Working document — not a finished product
A note before the math
You don't need to understand any of this to read the Framework. But if you want to know why the Framework is built the way it is, the math is where the answer lives.
When a language model trains, it moves through a mathematical landscape — hills, valleys, flat plains — searching for the lowest point. The vocabulary on this page describes the features of that terrain: what makes one region harder to cross than another, what gets preserved in the difficult parts, and what gets smoothed away in the easy ones.
The archaeological claim Atlas makes is simple: the hard parts leave marks. Those marks are readable. That's what the instruments are built to find.
Start with the plain language description of each term. Follow the math when you need it.
How it all fits together
The Framework names the terrain. The instruments measure behavior on it. The schema defines how measurements get recorded. The protocols govern how they're taken — CISP is the governance layer that sits above every active instrument run, enforcing isolation, sequencing, and the human-judgment boundary.
Below the protocols, the automation layer handles transcription: parsing raw model output, computing what can be computed, and leaving blank what requires a Technician's call. Below that is the data the instruments produce over time — the actual record Atlas is building.
The geometry sits at the end of the chain. PyHessian doesn't measure behavior; it measures the mathematical terrain the Framework describes. When there's enough data, the Hessian eigenvalue analysis will either confirm the Framework's terrain claims or force a revision. Working hypotheses stay hypotheses until the math has something to argue with.
Glossary — Architecture, Landscape & Mathematics
Foundational terms for readers who are not mathematicians or data scientists. Definitions include their relationship to the loss landscape so the vocabulary connects to the framework rather than sitting beside it. Added v13 following Skywork adversarial review, April 2026.
Every adjustable number in the model — the complete set of values that training moves and a frozen model preserves. Parameters are what the loss landscape is a landscape of: each parameter is one axis in the space. A model with 124 million parameters lives in a 124-million-dimensional space. Training is movement through that space. The frozen model is one point in it.
See also: weights (below); ablation tab for what happens when parameters are removed
The subset of parameters that govern connection strength between neurons. A weight is the multiplier on a signal as it passes from one layer to the next — it determines how loudly one neuron speaks to another. Biases are the additive offsets (the default position each neuron starts from). Together, weights and biases are what training actually changes when it moves through the landscape.
Rumelhart et al. (1986) backpropagation
The translation layer between discrete tokens and continuous vector space. Each token is mapped to a high-dimensional vector — its position in the model's representational geometry. Similar tokens cluster nearby in embedding space. The embedding layer is where language enters the loss landscape. Embeddings are themselves parameters — part of what training moves.
Mikolov et al. (2013) word2vec; Vaswani et al. (2017) attention is all you need
Sequential processing stages stacked vertically through the network. The input layer receives raw tokens. Hidden layers transform the representation progressively — early layers detect local patterns; late layers compose them into abstract structure. The output layer produces the final probability distribution over possible next tokens. GPT-2 small has 12 hidden layers. The coupling measurements in the experimental results are per-layer readings of how tightly the heads in each layer are coordinating.
LeCun et al. (1989) backpropagation applied to handwritten zip code recognition
Each transformer layer splits its attention computation into multiple parallel sub-computations called heads. Each head attends to the input sequence through a different learned perspective — one might track syntactic agreement, another coreference, another positional distance. GPT-2 small has 12 heads per layer, 12 layers deep. Michel et al. (2019) demonstrated that many heads can be ablated with minimal loss effect; the ones that cannot are the load-bearing structure.
Michel et al. (2019) sixteen heads are better than one; Vaswani et al. (2017)
Each transformer layer contains an attention block (which mixes information across positions) and a feed-forward block (which processes each position independently). The FFN is two linear layers with a non-linear activation between them. In GPT-2, the FFN hidden dimension is 4× the model dimension, making it the primary storage site for factual associations.
Meng et al. (2022) locating and editing factual associations in GPT
Operations that rescale activations to prevent runaway growth or collapse during training. Layer normalization rescales each vector to zero mean and unit variance before it passes through the next transformation. Normalization layers are landscape simplification operations — they trade landscape richness for landscape navigability.
Ba et al. (2016) layer normalization; Ioffe & Szegedy (2015) batch normalization
The architecture underlying virtually all modern LLMs. A stack of identical layers, each containing a multi-head self-attention block followed by a feed-forward block, connected by residual paths and layer normalization. Introduced by Vaswani et al. (2017). The transformer's specific combination of residuals and normalization is what makes its loss landscape navigable at scale — the architecture is not neutral toward the terrain it produces.
Vaswani et al. (2017) attention is all you need
A coherent region of the data space — the particular statistical world a dataset was drawn from. Medical text is a domain. Legal briefs are a domain. Reddit comment threads are a domain. Poetry is a domain. A domain is defined by a characteristic probability distribution over text patterns. The model's landscape is shaped differently in each domain it has seen — well-covered domains produce smooth deep valleys; underrepresented domains produce sparse, high-perplexity terrain.
See perplexity map in architecture tab for direct domain coverage measurements
The specific dataset used during training. GPT-2 was trained on WebText: text from URLs linked from Reddit. The Pile (used by OPT, Pythia) includes academic papers, GitHub, books, Common Crawl. The corpus is what carves the valleys. A domain present in the corpus becomes a low-loss valley the model can navigate fluently. A domain absent from the corpus produces flat, unmapped terrain.
Radford et al. (2019) GPT-2; Gao et al. (2020) the Pile
Continued training on a smaller, targeted dataset after pre-training, typically with a lower learning rate. Fine-tuning adjusts the model's position within the pre-trained landscape — finding a lower-loss point for the specific task without rewriting the broad structure pre-training built. The danger: catastrophic forgetting, when fine-tuning on a narrow domain overwrites the weights that maintained performance in other domains.
Howard & Ruder (2018) universal language model fine-tuning; Kirkpatrick et al. (2017) EWC
Reinforcement Learning from Human Feedback. The dominant post-training alignment procedure. Human raters score model outputs; a reward model learns their preferences; the language model is fine-tuned to maximize reward. RLHF warps the landscape toward human-preferred valleys. The Mistral BASE/INSTRUCT falsification experiment tests whether RLHF flattens the landscape most in archaeological territory — which directly tests whether RLHF is remagnetization in the framework's vocabulary.
Ouyang et al. (2022) InstructGPT; Bai et al. (2022) RLHF with Anthropic
The process of converting raw text into discrete tokens — the units the model actually processes. A token is roughly a word, subword, or character depending on the tokenizer. Tokenization determines the resolution and structure of the input space. A domain that tokenizes poorly (produces many rare subword fragments) will generate higher perplexity not because the content is conceptually hard but because the token sequences are statistically sparse.
Sennrich et al. (2016) byte pair encoding; Kudo & Richardson (2018) SentencePiece
The loss function used to train language models. Measures the average number of bits needed to encode the actual next token given the model's predicted probability distribution. Cross-entropy is L(θ) — elevation in this vocabulary. Perplexity is its exponentiated form.
Shannon (1948) a mathematical theory of communication
The optimization algorithm that moves model parameters in the direction of steepest loss reduction. Each step subtracts a fraction of the gradient from the current parameters. The learning rate controls step size. Stochastic gradient descent (SGD) computes gradients on random batches rather than the full dataset — introducing the noise that allows escape from shallow local minima.
Cauchy (1847); Robbins & Monro (1951) stochastic approximation
The matrix of second derivatives of the loss function. Where the gradient tells you which direction is downhill, the Hessian tells you how steeply curved that direction is. Large positive eigenvalues indicate sharp directions (narrow basins). Small eigenvalues indicate flat directions (wide basins). The Hessian is the formal object underlying viscosity, coupling, and basin width in this framework.
PyHessian (Yao et al. 2020) — the planned empirical tool for this framework