Scaling laws in language models¶

Scaling laws quantify how neural language model performance improves as a function of model size (parameters), training data (tokens), and compute (FLOPs). Empirical scaling laws have been central to the deep learning progress of the past decade, guiding billion-dollar investments in larger models.

Standard scaling laws¶

In most NLP tasks—language modeling, machine translation, question-answering—performance improves predictably with scale:

Larger models achieve lower perplexity
More training data improves downstream task performance
Performance follows power-law relationships: loss ∝ N^(-α), where N is parameters and α ≈ 0.07–0.1

These relationships suggest that scaling is a reliable path to capability improvement.

Inverse scaling phenomena¶

However, certain tasks and objectives exhibit inverse scaling—counterintuitively, larger models perform worse:

Truthfulness: - TruthfulQA demonstrates that larger GPT-3 and GPT-Neo models are less truthful than smaller variants in the same family - Larger models generate more plausible-sounding falsehoods - Hypothesized mechanism: larger models learn human misconceptions from training data more thoroughly

Other inverse scaling examples: - Some reasoning tasks show inverse scaling under few-shot prompting - Certain safety properties (e.g., refusing harmful requests) sometimes degrade with scale

Implications¶

For capability prediction: Standard scaling laws cannot be blindly extrapolated; certain properties must be explicitly optimized rather than assumed to improve with scale.

For safety: Larger models may require active alignment interventions (RLHF, constitutional AI) to maintain desirable behaviors that don't automatically improve with scale.

For resource allocation: Not all problems are "solved by more scale"—some require architectural or training objective changes.

Language model truthfulness — truthfulness exhibits inverse scaling
Neural language models — foundational scaling studies

Scaling laws in language models¶

Standard scaling laws¶

Inverse scaling phenomena¶

Implications¶

Related topics¶