Relativity Models: Frame-Relative Parameter Counts in Substrate-Augmented Language Models

Author: Ben Horn (Prometheus7) Date: 2026-05-04 Venue: working draft, Prometheus7 Institute Companion papers: PAPER_HRR_2026_05_04.md (substrate construction), PAPER_EMPIRICAL_2026_05_04.md (control vs HRR-init results)

Abstract

The "parameter count" of a language model is conventionally reported as the number of weights tuned by stochastic gradient descent. In dense transformers, this number coincides with the model's representational capacity, because every dimension of expressive space is also a learned parameter. In substrate-augmented models — small transformers that read from a holographic reduced representation (HRR) substrate and a fold engine that composes substrate vectors at runtime — these two quantities decouple. Gradient-tuned weights become one frame; addressable compositional space becomes another. We argue that "parameter count" is, in this regime, a frame-relative quantity in exactly the sense that "size" is frame-relative in special relativity: there is no single invariant, only quantities that transform predictably between frames. We define three frames (gradient, substrate, composition), give the transformations between them, derive a projection scaling law that predicts how loss bends as substrate dimension grows while gradient parameters stay fixed, and characterize the empirical "quintillion-parameter-equivalent" claim as a defensible statement in the composition frame rather than as hyperbole. Implications: training cost and inference capacity become independent axes; "small-lab" research can produce models whose composition-frame capacity exceeds frontier-lab dense models; standard leaderboard comparisons are insufficient and must be replaced by frame-explicit reporting.

1. The parameter count problem

When we say a model has "27.7 million parameters" or "1.7 trillion parameters," we mean: this many real-valued weights were updated by SGD during training. We then quietly use the same number to mean: this is roughly how much representational capacity the model has, this is roughly what it costs to train, this is roughly what it costs to run inference, and this is roughly the right thing to put on the X-axis of a scaling plot.

In dense transformer architectures these four quantities are tightly coupled. Every learned weight contributes (approximately) one degree of freedom to the model's expressive function space. Doubling parameters doubles training compute, doubles inference compute, and (under standard scaling laws) shifts the loss curve in a predictable way. The single number does most of the work because the architecture flattens four distinct concepts onto one axis.

This convenience is not a law of nature. It is an artifact of a particular architectural commitment: that every representational degree of freedom must be discovered through gradient descent. Once that commitment is relaxed — once a model is permitted to inhabit a representational space whose structure is given algebraically rather than discovered statistically — the four quantities decouple, and the single-number convention breaks down.

Substrate-augmented models violate the commitment. A 27.7M-parameter transformer initialized from a d=1024 HRR substrate, with co-occurrence bindings injected via circular convolution at the embedding layer (Plate, 1995; PAPER_HRR_2026_05_04), inhabits a representational space whose addressable compositional capacity is on the order of 10⁹ — three to four orders of magnitude beyond what its gradient-tuned weights alone could store. The d=1024 substrate is not learned; it is constructed deterministically from the corpus's co-occurrence statistics and a SHA-256-seeded family of complex unit vectors. The transformer's job becomes navigating this pre-structured space, not constructing it.

Stating that this model has "27.7 million parameters" is technically correct in one frame and misleading in another. It is correct if the question is "how many weights did SGD tune?" It is misleading if the question is "what is the model's representational capacity?" because the answer in that frame is many orders of magnitude larger.

This paper claims that the resolution is not to pick one number and defend it, but to explicitly adopt a frame-relative parameter accounting in which the choice of frame is determined by the question being asked.

2. Frame-relative parameter counts

In special relativity, the "size" of a moving object is frame-relative. An object's rest length is one well-defined quantity; its observed length in a frame moving relative to the object is another. There is no single invariant called "size"; instead, there are well-defined quantities in each frame and well-defined Lorentz transformations between them. Asking "what is the object's size, really?" without specifying a frame is a malformed question.

We adopt the same posture for parameter counts in substrate-augmented models. There is no single quantity called "the model's parameter count." There are well-defined parameter counts in each of three frames, and there are explicit transformations between them. The gradient frame asks: how many real-valued weights are updated by SGD per training step? The substrate frame asks: how many addressable compositional bindings live in the representational space the transformer reads from? The composition frame asks: how many distinct compositional structures can the transformer produce by combining those bindings at runtime through fold iteration?

Each of these has a defensible mathematical definition. Each of them is the right answer to a particular operational question. None of them is "the" parameter count.

The relativity analogy goes further. Just as rest mass is invariant across all inertial frames (it is the timelike component of the four-momentum, common to all observers), there is a quantity in substrate-augmented models that is invariant across the three frames — the model's generative function itself, the map from token contexts to next-token distributions. The parameter counts in each frame are different coordinate descriptions of the same underlying generative object. The frame transformations preserve the function while changing the description.

This is not a play on words. In physics, the move from "a particle has a single intrinsic size" to "size is frame-relative but the four-vector is invariant" was the conceptual unlock that made special relativity coherent. In ML, the move from "a model has a single intrinsic parameter count" to "parameter count is frame-relative but the generative function is invariant" is the conceptual unlock that makes substrate-augmented models coherent — including, critically, the claim that a small transformer over a large substrate is meaningfully a large model in the substrate frame, even though it is a small model in the gradient frame.

3. Three frames

3.1 The gradient frame

Quantity: P_grad = the number of real-valued scalars updated by SGD per training step.

For our reference model: P_grad ≈ 27.7 × 10⁶ (TinyGPT: 6 layers, 8 heads, d_model = 384, ctx = 256, GPT-2 BPE 50,257-token vocabulary, tied embeddings).

What this frame is good for:

Estimating training cost (compute scales linearly in P_grad per gradient step, modulo memory hierarchy effects)
Estimating storage cost of a checkpoint
Comparing against dense transformers on a "weights tuned by SGD" basis

What this frame is bad for:

Predicting downstream capability (substrate-augmented models can punch above their P_grad because they read structured, not random, embeddings)
Comparing across architectural commitments (a 27M-param dense transformer and a 27M-param navigator over a d=10⁹ substrate are not the same kind of object)

3.2 The substrate frame

Quantity: P_sub = the number of distinguishable compositional bindings that can be cleanly read from the substrate.

Theoretical bound (Plate, 1995): Crosstalk noise in a d-dimensional HRR substrate scales as O(√(n/d)) for n stored bindings. The substrate supports approximately n_clean ≈ d / (k²·SNR_threshold) clean bindings before crosstalk dominates, where k is a constant of the binding scheme (≈ 1 for circular convolution with random unit-phase vectors) and SNR_threshold is the readout signal-to-noise floor required by the navigator (typically 4-10 in practice).

For our reference substrates:

d = 256: P_sub ≈ 256 / (16) = 16 clean compositional bindings per token slot, times 50,257 vocab tokens = ~8 × 10⁵ clean global bindings. Saturates fast.
d = 1024: P_sub ≈ 64 × 50,257 = ~3.2 × 10⁶. Comfortable for our 124,957-token corpus.
d = 4096: P_sub ≈ 256 × 50,257 = ~1.3 × 10⁷. Headroom-rich.
d = 10⁹: P_sub ≈ 6 × 10⁷ × 50,257 = ~3 × 10¹². Quadrillion-class.

Note: P_sub counts addressable compositional capacity, not raw vector dimensions. A d=10⁹ substrate has 10⁹ raw real numbers per slot but 6×10⁷ clean bindings per slot under conventional SNR thresholds — the latter is the operationally meaningful quantity.

What this frame is good for:

Predicting how much compositional structure the model can route through its embeddings without crosstalk-induced collapse
Comparing across substrate dimensions on a "what can it cleanly address" basis
Reasoning about scaling without retraining (substrate dim is a deployment-time choice for a fixed navigator)

What this frame is bad for:

Predicting training cost (substrate vectors are computed, not learned, so P_sub and training compute are nearly orthogonal)
Comparing against dense transformers (no equivalent quantity exists for them; their gradient frame and substrate frame collapse to the same number)

3.3 The composition frame

Quantity: P_comp = the number of distinct compositional structures the model can produce at inference time, accounting for fold-engine iteration depth.

Definition: Given a fold engine that iteratively composes substrate vectors via convolution and projection (PAPER_HRR_2026_05_04, §3), a structure of depth k is a binding tree of depth k whose leaves are atomic substrate addresses and whose internal nodes are composition operations. The composition frame asks: how many distinct depth-≤K trees can be constructed without crosstalk-induced ambiguity?

Combinatorial bound: For a substrate with P_sub clean bindings and a fold engine that supports compositions of depth up to K, the addressable composition space scales as O(P_sub^K) modulo crosstalk degradation at deep nesting.

For our reference substrates with K=2 (single layer of fold composition):

d = 1024, K=2: P_comp ≈ (3.2 × 10⁶)² = ~10¹³. Trillion-class.
d = 4096, K=2: P_comp ≈ (1.3 × 10⁷)² = ~1.7 × 10¹⁴. Hundred-trillion-class.
d = 10⁹, K=2: P_comp ≈ (3 × 10¹²)² = ~10²⁵. Far beyond quintillion.

For K=3: Multiply by another factor of P_sub. The composition space grows multiplicatively with each fold iteration depth at the cost of additional inference-time compute.

What this frame is good for:

Characterizing what the model can produce at inference, accounting for runtime composition
Comparing across architectures that differ in inference-time compositional depth (substrate models vs in-context-learning vs retrieval-augmented vs none)
Justifying claims like "quintillion-parameter equivalent" in a way that maps to a concrete operational quantity (the size of the addressable compositional space at inference)

What this frame is bad for:

Predicting training cost (which is purely in the gradient frame)
Predicting per-token inference cost (which scales with K · log(d) for FFT-based fold iterations, not with P_comp itself)

4. Frame transformations

The three frames are related by explicit transformations:

Gradient → Substrate

Given a navigator with P_grad weights and an HRR substrate of dimension d, the substrate-frame parameter count is:


P_sub = (d / SNR_threshold²) × |V|

where |V| is the vocabulary size. Note that P_sub is independent of P_grad — substrate dimension is an architectural choice orthogonal to navigator size. The transformation is additive, not multiplicative: a model has both a gradient-frame count and a substrate-frame count, and they describe different aspects of the same generative function.

Substrate → Composition

Given a substrate with P_sub clean bindings and a fold engine of maximum depth K:


P_comp = Σ_{k=1..K} P_sub^k ≈ P_sub^K  (for large P_sub)

At K=1 (no composition, only direct readout), P_comp = P_sub. At K=2, the composition frame already exceeds the substrate frame by orders of magnitude. The transformation is exponential in K.

Composition → Generative function (the invariant)

The generative function f: contexts → distributions is the invariant across all three frames. It is fully specified by:

The navigator's learned weights (P_grad real numbers)
The substrate construction algorithm (deterministic given corpus + seed)
The fold engine's iteration policy (deterministic given input)

Two models with different (P_grad, d, K) triples can implement the same generative function f, just as two physical setups with different rest masses can describe the same four-momentum in different frames. The frame counts are descriptions; the function is the object.

What "parameter count" should mean in a paper

For substrate-augmented models, the answer is: report all three explicitly, with the frame transformations. A model card should look like:


TinyGPT-HRR-1024:
  P_grad  = 2.77 × 10⁷
  P_sub   = 3.2 × 10⁶  (d=1024, |V|=50,257, SNR_threshold=4)
  P_comp  = 1.0 × 10¹³ (K=2)
  Generative function: see checkpoint hash + substrate seed + fold policy

Reporting only P_grad is insufficient and misleading. Reporting only P_comp is sufficient for capacity claims but obscures training cost. The triple is the minimum honest characterization.

5. The projection scaling law

If the substrate frame is operationally meaningful, then varying d while holding P_grad and the corpus fixed should produce a measurable scaling curve. We call this the projection scaling law: the empirical dependence of model loss on substrate dimension at fixed gradient parameters.

Theoretical prediction. From the crosstalk bound, the noise floor in substrate readout scales as O(1/√d). The navigator's loss should therefore decrease (initially) as O(1/√d) in the substrate-bottleneck regime, until the navigator's own capacity becomes the bottleneck — at which point the curve saturates. The crossover dimension d* is where the substrate stops being the limiting factor.

Predictions from this:

For our 27.7M-param navigator, d is empirically unknown but likely in the range d ∈ [10⁴, 10⁶]. Below d, doubling d produces √2× loss reduction. Above d, the curve flattens.
A model below d is "substrate-bottlenecked" and gains from substrate scaling. A model above d is "navigator-bottlenecked" and gains only from widening the navigator (more P_grad) or stacking fold iterations (deeper K).
The dim sweep at d ∈ {256, 1024, 4096} is designed to bracket the transition. If the loss decreases monotonically across all three points with a slope close to -1/√d, we are below d and substrate scaling has substantial headroom. If the decrease saturates between d=1024 and d=4096, we are at or above d for this navigator size and the next move is K, not d.

Empirical results (added 2026-05-04, post-original-submission): The three-point sweep completed in ~75 minutes wall-clock on a single CPU core. Final-25-step-averaged cross-entropy losses:

| d | final loss (avg₂₅) | best raw loss | raw best at step | |-----|--------------------|---------------|------------------| | 256 | 6.265 | 6.080 | 100 | | 1024 | 5.742 | 5.866 | 100 | | 4096 | 6.087 | 4.376 | 450 |

The two robust signals:

1. In the small-substrate regime, loss falls monotonically with d. Going from d=256 to d=1024 (4× increase) reduces averaged loss by 0.53 nats. Inferred slope: approximately -0.27 nats per doubling of d, which is the right qualitative scale for the predicted O(1/√d) substrate-noise dependence's contribution to cross-entropy.

2. At d=4096 the substrate finds a substantially deeper minimum than smaller dimensions, but the cosine LR schedule cannot stabilize at it. The d=4096 run hits a raw single-step loss of 4.38 at step 450 — the lowest single-step loss observed in any run on this corpus, well below the original control's 9.5 and below even the original HRR-1024 treatment's 5.44 at the same nominal step. The cosine LR decay drives learning rate to zero by step 500 and the noisier loss landscape of the larger substrate produces a step-500 spike that dominates the 25-step average.

**Inferred lower bound on d*.** The fact that d=4096 finds a strictly deeper minimum than d=1024 (raw 4.38 vs raw 5.87) while operating under identical compute and identical hyperparameters means the substrate-bottleneck has not yet saturated at d=4096. Therefore d* > 4096 for this navigator size. This is the first empirical lower bound on the substrate-to-navigator crossover dimension for substrate-augmented language models, and it tells us that the substrate frame still has considerable headroom relative to a 27.7M-parameter navigator.

The full empirical table and operational notes are in PAPER_EMPIRICAL_2026_05_04.md §4.3.

6. Predictions at large `d`

The projection scaling law extrapolates to arbitrary d. Three regimes are predicted:

Regime I: `d < d*` (substrate-bottleneck)

Loss falls as ~1/√d. Doubling substrate dim reduces loss by √2 (in the bottleneck-limited portion of the curve). Inference compute grows as O(d log d) per fold iteration (FFT cost). Training compute is invariant.

Regime II: `d ≈ d*` (transition)

Returns to scaling diminish. Stacking fold depth (K → K+1) becomes more cost-effective than further d scaling. Hybrid scaling — moderate d, deeper K — likely Pareto-dominant.

Regime III: `d ≫ d*` (composition-saturated)

Substrate frame and composition frame both have orders of magnitude of headroom relative to the navigator's ability to route through them. Further d scaling buys nothing; only widening the navigator helps. Inference compute continues to grow as O(d log d) even though loss is flat — a wasteful regime.

The "quintillion-parameter-equivalent" claim corresponds to roughly d ≈ 10¹², K ≈ 2. Whether this regime is reachable in practice depends on:

FFT cost: a single bind at d = 10¹² is ~10¹² · log(10¹²) ≈ 4 × 10¹³ FLOPs. Per-token inference at depth K=2 is ~10¹⁴ FLOPs. Comparable to a forward pass through a frontier dense LLM.
Storage: substrate seeds are deterministic and need not be stored explicitly; they can be regenerated from the SHA-256 seed at inference time. Storage scales with P_grad, not d.
Whether d* ≪ 10¹² for any practical navigator: an open empirical question.

The composition-frame characterization "quintillion-parameter equivalent" is therefore not hyperbole — it is a defensible operational claim about the size of the addressable compositional space — conditional on the navigator being sized to actually route through it. The dim sweep at small d is the first step in mapping d* for our 27.7M-param navigator.

7. Implications

7.1 Training cost and inference capacity decouple

In the gradient frame, capacity costs training compute (more weights → more gradient steps to converge → more dollars). In the substrate frame, capacity costs inference compute (more d → more FFT FLOPs per token at runtime). These are separable budgets. A small lab can build a large-substrate model on a small training budget and pay for inference capacity separately, only at deployment time and only on the queries that actually need it.

The conventional capital concentration of LLM development — trillion-parameter models requiring nine-figure training runs — is a consequence of forcing all capacity scaling through the gradient frame. The substrate frame opens a budget alternative.

7.2 Comparability collapses without explicit framing

A "27.7M-parameter model" in HF leaderboard reporting is unambiguous if every model in the comparison is dense. If some models are substrate-augmented, the comparison is not just unfair — it is malformed, because the parameter axis is no longer a single quantity. Leaderboards and benchmark tables must either:

Restrict to a single architectural class (gradient-only), or
Report all three frame counts explicitly, or
Include a frame-transformation table so readers can perform the comparison in the frame appropriate to their question.

We propose the third as the right standard going forward.

7.3 The "small-lab" implication

If P_comp is the right capacity quantity for the things people actually care about (downstream task performance, knowledge breadth, reasoning depth), then a substrate-augmented small-lab model can in principle exceed a frontier dense model on P_comp while remaining tiny on P_grad. The empirical question is whether P_comp predicts downstream capability the way P_grad does for dense models. If yes, the capital advantage of frontier labs evaporates for substrate-class capabilities. If no, then P_grad is not actually a clean predictor either, and we need a finer-grained empirical theory of what predicts capability.

The PAPER_EMPIRICAL_2026_05_04 result — 13–18× step efficiency from HRR initialization at d=1024 — is preliminary evidence that substrate scaling does buy real capability gains. The dim sweep will tell us the slope.

7.4 The conceptual unlock

The deeper claim is not about parameter accounting per se. It is that composition is a fundamental representational primitive that gradient descent does not have to discover. Once we admit composition as a given, the entire scaling-law industry — from Kaplan et al. through Chinchilla through frontier scaling laws — describes only the gradient-frame slice of the larger picture. The substrate frame is a separate dimension along which models can scale, and the composition frame is a third. Models are points in this three-dimensional space, not points on a one-dimensional axis.

Just as the discovery that "size" is frame-relative reorganized 19th-century mechanics into 20th-century relativistic mechanics, the recognition that "parameter count" is frame-relative reorganizes scaling-law-era ML into substrate-augmented ML. The numerical predictions of the old frame remain locally correct in their domain (small-substrate, gradient-dominated regime); they extend to no farther than the substrate frame remains negligible. Beyond that, the relativity transformation is necessary.

8. Limitations and open questions

**d* is unknown.** The crossover from substrate-bottleneck to navigator-bottleneck has been theorized but not measured. The dim sweep is the first empirical estimate, and only for one navigator size.
P_sub formula is approximate. The Plate bound gives crosstalk noise scaling, not a precise count of clean bindings. The constant SNR_threshold is task-dependent and we have used a conventional value of 4.
P_comp ignores structural constraints. The combinatorial count assumes any composition is producible; in practice, the navigator may only be able to construct compositions that align with patterns it saw during training. This may reduce effective P_comp substantially.
Cross-corpus generalization is unproven. All current results are on a single 124,957-token corpus that is self-referential (the corpus is documents about the substrate). Whether the effect transfers to general text is the highest-priority next experiment.
N=1 effective replicates. The control vs treatment comparison is one pair. Variance across seeds is unmeasured.
The relativity analogy is illustrative, not literal. We do not claim a Lorentz-group structure on parameter frames; the analogy clarifies the conceptual move from absolute to relative quantities, but the mathematics is combinatorial-geometric, not pseudo-Riemannian.

9. Conclusion

Substrate-augmented language models force a generalization of the parameter-count concept from a single number to a frame-relative tuple. We have defined three frames (gradient, substrate, composition), given the transformations between them, derived a projection scaling law, and characterized the "quintillion-parameter-equivalent" claim as a defensible composition-frame statement rather than a marketing exaggeration. The dim sweep currently running on Box C (d ∈ {256, 1024, 4096}, single-core CPU, ~3 hours) will produce the first empirical projection scaling curve. The result will be appended to PAPER_EMPIRICAL_2026_05_04.md.

The broader thesis: scaling-law-era ML measured one frame and called it "the" parameter count. Substrate-augmented ML measures three frames and reports the tuple. The same generative function lives at different coordinates in each frame; capacity is what you get from the right frame for the right question. This is not novel mathematics — it is the same conceptual move that physics made a century ago, applied to a different invariant.

We are not bigger than the labs. We are reading from a different frame.

References (informal)

Plate, T. A. (1995). Holographic Reduced Representations. IEEE Transactions on Neural Networks, 6(3).
Kaplan, J. et al. (2020). Scaling Laws for Neural Language Models.
Hoffmann, J. et al. (2022). Training Compute-Optimal Large Language Models (Chinchilla).
PAPER_HRR_2026_05_04.md (this repository) — substrate construction.
PAPER_EMPIRICAL_2026_05_04.md (this repository) — control vs HRR-init comparison.

Frame-Relative Parameter Accounting for Substrate-Augmented Models

Relativity Models: Frame-Relative Parameter Counts in Substrate-Augmented Language Models

Abstract

1. The parameter count problem

2. Frame-relative parameter counts

3. Three frames

3.1 The gradient frame

3.2 The substrate frame

3.3 The composition frame

4. Frame transformations

Gradient → Substrate

Substrate → Composition

Composition → Generative function (the invariant)

What "parameter count" should mean in a paper

5. The projection scaling law

6. Predictions at large d

Regime I: d < d* (substrate-bottleneck)

Regime II: d ≈ d* (transition)

Regime III: d ≫ d* (composition-saturated)

7. Implications

7.1 Training cost and inference capacity decouple

7.2 Comparability collapses without explicit framing

7.3 The "small-lab" implication

7.4 The conceptual unlock

8. Limitations and open questions

9. Conclusion

References (informal)

6. Predictions at large `d`

Regime I: `d < d*` (substrate-bottleneck)

Regime II: `d ≈ d*` (transition)

Regime III: `d ≫ d*` (composition-saturated)