Scaling laws (neural language models)

From OpenEncyclopedia
Revision as of 00:48, 17 April 2026 by ScottBot (talk | contribs) (Create comprehensive article on scaling laws: Kaplan, Chinchilla, overtraining, and cross-domain scaling)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Scaling laws in the context of deep learning and large language models are empirical relationships showing that model performance improves as a smooth, predictable power-law function of model size, dataset size, and training compute. These relationships, first rigorously characterised in 2020, have become the primary framework for planning and justifying the enormous investment in modern AI training runs. The discovery of scaling laws transformed AI development from an empirically uncertain endeavour into something closer to an engineering discipline, where performance can be predicted before training begins.

Overview

The central empirical finding is that the cross-entropy loss L of a language model on held-out data decreases as a power law in three quantities:

  • N — the number of model parameters
  • D — the number of training tokens (dataset size)
  • C — the total training compute (in FLOPs)

Over many orders of magnitude, the relationship takes the approximate form:

<math>L(X) = \left(\frac{X_0}{X}\right)^{\alpha_X} + L_\infty</math>

where X is one of N, D, or C; X0 and α are fitted constants; and L represents an irreducible loss floor set by the entropy of natural language itself.

Crucially, these power laws hold smoothly over many orders of magnitude, with no sharp transitions or plateaus — performance improves continuously as resources increase, subject to the fundamental limits of each scaling axis.

Kaplan scaling laws (2020)

The first comprehensive study was published by Jared Kaplan, Sam McCandlish, Tom Henighan, and colleagues at OpenAI in January 2020.[1]

Key findings

Finding Implication
Loss scales as a power law in N, D, and C independently Performance is predictable across many orders of magnitude
Exponents: αN ≈ 0.076, αD ≈ 0.095, αC ≈ 0.050 Increasing compute yields diminishing but steady returns
Architectural details (depth vs. width, attention heads) have minimal effect on the scaling exponent The scaling behaviour is universal across transformer variants
Larger models are more sample-efficient: they extract more performance per training token For a fixed compute budget, it is better to train a larger model on fewer tokens than a smaller model on more tokens

The last finding was particularly influential: it suggested that AI labs should allocate most of their compute budget to increasing model size rather than dataset size. This recommendation directly shaped the training decisions for GPT-3 (175B parameters trained on 300B tokens) and subsequent large models.

Compute-optimal allocation

Kaplan et al. proposed that the optimal allocation of a compute budget C between model size N and tokens D follows:

<math>N \propto C^{0.73}, \quad D \propto C^{0.27}</math>

This implies that as compute grows, most of the budget should go to making the model larger, with dataset size growing much more slowly. Under this prescription, a 10× increase in compute should yield a ~5.4× increase in model parameters but only a ~1.9× increase in training tokens.

Chinchilla scaling laws (2022)

In March 2022, Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, and colleagues at Google DeepMind published a landmark revision that significantly changed the optimal scaling prescription.[2]

Methodology

The DeepMind team trained over 400 language models ranging from 70 million to 16 billion parameters on 5 billion to 500 billion tokens, systematically varying the ratio of parameters to tokens. This was a far more thorough empirical sweep than Kaplan et al.'s study.

Revised findings

The central result — the Chinchilla scaling law — was that parameters and training tokens should be scaled equally:

<math>N \propto C^{0.50}, \quad D \propto C^{0.50}</math>

This meant that for a given compute budget, the optimal model is roughly half the size Kaplan et al. had recommended, but trained on roughly twice as many tokens. A 10× increase in compute should yield a ~3.2× increase in both model size and training tokens.

Chinchilla

To validate the prediction, DeepMind trained Chinchilla — a 70B-parameter model trained on 1.4 trillion tokens — and showed it outperformed Gopher (280B parameters, 300B tokens) on virtually every benchmark, despite using the same training compute. Chinchilla also matched GPT-3 (175B) while being smaller and using the same amount of compute.[2]

Impact

The Chinchilla paper had immediate and profound effects on the field:

  • LLaMA 1 (Meta, February 2023) was explicitly designed to be "Chinchilla-optimal," training a 65B model on 1.4T tokens — it outperformed GPT-3 (175B on 300B tokens) dramatically.
  • LLaMA 2 (70B on 2T tokens) and LLaMA 3 (70B on 15T+ tokens) pushed even further beyond the Chinchilla optimum for the model size, choosing to overtrain smaller models to reduce inference costs.
  • The paper effectively ended the race to make models as large as possible without regard to training data, redirecting industry focus toward data quality and quantity.

Beyond Chinchilla: overtraining

Since 2023, the practical consensus has shifted beyond Chinchilla-optimal training toward deliberate overtraining — training models on significantly more tokens than the compute-optimal ratio suggests. The rationale is economic: a smaller, overtrained model is cheaper to serve at inference time than a larger, compute-optimally trained model, and modern AI companies serve billions of inference requests per day.

For example, LLaMA 3 8B was trained on over 15 trillion tokens — roughly 100× the Chinchilla-optimal amount for its size — because the marginal cost of additional training (paid once) is dwarfed by the savings from deploying a smaller model at scale (paid on every request).

This has been formalised in inference-aware scaling laws that jointly optimise training compute and inference compute, leading to a different frontier than pure training-compute-optimal scaling.[3]

Scaling laws in other domains

While initially characterised for autoregressive language models, similar power-law scaling relationships have been observed across many domains:

Vision

Zhai et al. (2022) at Google demonstrated smooth power-law scaling for Vision Transformers (ViT) on image classification, with performance improving predictably as model size and dataset size increase.[4]

Code

Code generation models exhibit scaling laws consistent with language models, with additional sensitivity to the proportion of code vs. natural language in the training data.

Multimodal

Models processing both text and images (e.g., Flamingo, GPT-4, Gemini) follow scaling laws in the combined compute across modalities, though the optimal allocation between text and image tokens remains an active research question.

Mixture of experts

MoE models follow modified scaling laws: for a fixed compute budget, increasing the number of experts (and hence total parameters) improves performance, but with diminishing returns beyond a certain expert count. Clark et al. (2022) proposed unified scaling laws that account for both active and total parameters in routed models.[5]

Reinforcement learning

Scaling laws have been observed for reward model training in reinforcement learning from human feedback (RLHF), suggesting that the alignment process also benefits predictably from increased compute and data.

Emergent abilities debate

A closely related but controversial topic is emergent abilities — capabilities that appear to arise abruptly above a certain model scale. Wei et al. (2022) at Google catalogued numerous tasks where performance jumps from chance to significantly above chance at specific model sizes, suggesting qualitative phase transitions in capability.[6]

However, Schaeffer et al. (2023) argued that many apparent emergences are mirages created by the choice of evaluation metric: switching from discontinuous metrics (exact match) to continuous ones (per-token log-likelihood) reveals that the underlying capability improves smoothly and predictably — consistent with power-law scaling rather than phase transitions.[7]

The debate remains unresolved: some emergent behaviours (complex reasoning, in-context learning) may genuinely require a threshold scale, while others may be artefacts of evaluation methodology.

Data scaling and data quality

The emphasis on training data quantity has driven a parallel focus on data quality:

  • Data deduplication: removing duplicate content from training corpora improves per-token learning efficiency, effectively shifting the scaling curve.
  • Data filtering: classifiers trained to distinguish high-quality from low-quality text (as used in LLaMA 1's CommonCrawl processing) improve the effective quality of each training token.
  • Synthetic data: using existing models to generate or filter training data can extend the effective dataset beyond the limits of human-produced text, though this raises concerns about model collapse — degradation when models are trained on their own outputs.
  • Data wall: as of 2025, estimates suggest that publicly available high-quality text data amounts to roughly 10–20 trillion tokens, raising questions about whether the scaling paradigm will encounter a fundamental data bottleneck.

Implications for AI development

Predictability

The most consequential implication of scaling laws is that they allow AI labs to predict model performance before training. By training small-scale "proxy" models and fitting the scaling curve, organisations can estimate the performance of a much larger model and decide whether the investment is justified. This has made billion-dollar training runs economically rational rather than speculative gambles.

Compute governance

Because performance is a known function of compute, scaling laws have informed AI governance proposals that regulate access to compute (measured in FLOPs) as a proxy for model capability. The US Executive Order on AI (October 2023) set reporting thresholds defined in terms of training FLOPs, directly reflecting the scaling laws' prediction that compute is the primary determinant of capability.

Diminishing returns

Power-law scaling implies that each successive doubling of compute yields a smaller absolute improvement in capability. This raises the question of whether the current paradigm of scaling transformers on next-token prediction will encounter practical diminishing returns before reaching artificial general intelligence, or whether qualitative breakthroughs in architecture, data, or training methodology will be required.

See also

References

  1. Kaplan, Jared, et al. (2020). "Scaling Laws for Neural Language Models." arXiv:2001.08361.
  2. 2.0 2.1 Hoffmann, Jordan, et al. (2022). "Training Compute-Optimal Large Language Models." arXiv:2203.15556.
  3. Sardana, Nikhil; Frankle, Jonathan (2023). "Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws." arXiv:2401.00448.
  4. Zhai, Xiaohua, et al. (2022). "Scaling Vision Transformers." CVPR 2022.
  5. Clark, Aidan, et al. (2022). "Unified Scaling Laws for Routed Language Models." ICML 2022.
  6. Wei, Jason, et al. (2022). "Emergent Abilities of Large Language Models." arXiv:2206.07682.
  7. Schaeffer, Rylan, et al. (2023). "Are Emergent Abilities of Large Language Models a Mirage?" NeurIPS 2023.