Neural Scaling Laws

Loss scales as a power law with model parameters N, dataset tokens D, and compute C. Kaplan et al. 2020 and Hoffmann et al. 2022 (Chinchilla) charted these laws for language models.

0.07
0.05
1.00
7
Kaplan et al. (2020): L(N) ≈ (N_c/N)^α_N, L(D) ≈ (D_c/D)^α_D, power laws over many orders of magnitude.
Chinchilla (Hoffmann 2022): L(N,D) ≈ E + A/N^α + B/D^β. Optimal allocation: N ∝ C^0.5, D ∝ C^0.5 — equal scaling.
Emergent abilities: Some capabilities appear sharply (non-power-law) at critical scales, debated as artifacts of metrics vs true phase transitions.