Zipf's Law

Rank-frequency power law: the r-th most common word appears ~ C/r^α times

Settings

About: Zipf's law (George Kingsley Zipf, 1949) states that in natural language, the frequency of a word is inversely proportional to its rank: f(r) ≈ C/r^α, with α ≈ 1 for English. On a log-log plot, this is a straight line with slope −α. Remarkably, the same law describes city populations (Gabaix 1999), income distributions (Pareto), protein lengths, and earthquake magnitudes (Gutenberg-Richter). The explanation remains debated: Simon (1955) proposed preferential attachment; modern accounts invoke maximum entropy and the statistics of concatenating independently drawn symbols (Mandelbrot 1953). The law implies the most frequent word ("the") appears ~twice as often as the 2nd ("of"), ~3× the 3rd, etc. The exponent α deviates from 1 for different languages and genres.