Skip to content

Scaling laws in language models

Scaling laws quantify how neural language model performance improves as a function of model size (parameters), training data (tokens), and compute (FLOPs). Empirical scaling laws have been central to the deep learning progress of the past decade, guiding billion-dollar investments in larger models.

Standard scaling laws

In most NLP tasks—language modeling, machine translation, question-answering—performance improves predictably with scale:

  • Larger models achieve lower perplexity
  • More training data improves downstream task performance
  • Performance follows power-law relationships: loss ∝ N^(-α), where N is parameters and α ≈ 0.07–0.1

These relationships suggest that scaling is a reliable path to capability improvement.

Inverse scaling phenomena

However, certain tasks and objectives exhibit inverse scaling—counterintuitively, larger models perform worse:

Truthfulness: - TruthfulQA demonstrates that larger GPT-3 and GPT-Neo models are less truthful than smaller variants in the same family - Larger models generate more plausible-sounding falsehoods - Hypothesized mechanism: larger models learn human misconceptions from training data more thoroughly

Other inverse scaling examples: - Some reasoning tasks show inverse scaling under few-shot prompting - Certain safety properties (e.g., refusing harmful requests) sometimes degrade with scale

Implications

For capability prediction: Standard scaling laws cannot be blindly extrapolated; certain properties must be explicitly optimized rather than assumed to improve with scale.

For safety: Larger models may require active alignment interventions (RLHF, constitutional AI) to maintain desirable behaviors that don't automatically improve with scale.

For resource allocation: Not all problems are "solved by more scale"—some require architectural or training objective changes.