Scaling laws in language models¶
Scaling laws quantify how neural language model performance improves as a function of model size (parameters), training data (tokens), and compute (FLOPs). Empirical scaling laws have been central to the deep learning progress of the past decade, guiding billion-dollar investments in larger models.
Standard scaling laws¶
In most NLP tasks—language modeling, machine translation, question-answering—performance improves predictably with scale:
- Larger models achieve lower perplexity
- More training data improves downstream task performance
- Performance follows power-law relationships: loss ∝ N^(-α), where N is parameters and α ≈ 0.07–0.1
These relationships suggest that scaling is a reliable path to capability improvement.
Inverse scaling phenomena¶
However, certain tasks and objectives exhibit inverse scaling—counterintuitively, larger models perform worse:
Truthfulness: - TruthfulQA demonstrates that larger GPT-3 and GPT-Neo models are less truthful than smaller variants in the same family - Larger models generate more plausible-sounding falsehoods - Hypothesized mechanism: larger models learn human misconceptions from training data more thoroughly
Other inverse scaling examples: - Some reasoning tasks show inverse scaling under few-shot prompting - Certain safety properties (e.g., refusing harmful requests) sometimes degrade with scale
Implications¶
For capability prediction: Standard scaling laws cannot be blindly extrapolated; certain properties must be explicitly optimized rather than assumed to improve with scale.
For safety: Larger models may require active alignment interventions (RLHF, constitutional AI) to maintain desirable behaviors that don't automatically improve with scale.
For resource allocation: Not all problems are "solved by more scale"—some require architectural or training objective changes.
Related topics¶
- Language model truthfulness — truthfulness exhibits inverse scaling
- Neural language models — foundational scaling studies