Frequency Analysis

About this experiment

Frequency analysis is the study of how often letters (or groups of letters) appear in a text, and it is the oldest known technique for breaking ciphers. The method was first described by the Arab polymath Abu Yusuf Ya'qub ibn Ishaq al-Kindi around 850 CE in his remarkable treatise A Manuscript on Deciphering Cryptographic Messages. Al-Kindi observed that in any natural language, certain letters appear far more frequently than others, and that this statistical fingerprint survives encryption by simple substitution. In English, the letter E accounts for roughly 12.7% of all letters, followed by T (9.1%), A (8.2%), O (7.5%), and so on. A substitution cipher replaces each letter with a different letter, but it preserves these relative frequencies — the most common letter in the ciphertext is very likely to be E, the next most common is likely T, and so forth.

The power of this technique lies in the remarkable stability of letter frequencies across large samples of English text. Whether you analyze a Dickens novel, a physics textbook, or a collection of newspaper editorials, the frequency distribution is nearly identical. This stability arises from the deep structure of the language: the most common words (the, of, and, to, a) are short and heavily weighted toward certain letters. Even a few hundred characters of ciphertext usually contain enough statistical signal to begin cracking the code. The bar chart above shows the ciphertext frequencies alongside the expected English frequencies — matching peaks between the two distributions is the key to breaking the cipher.

For centuries, simple substitution ciphers were considered unbreakable, and they were used by governments, military commanders, and secret societies to protect their most sensitive communications. Al-Kindi’s frequency analysis shattered that illusion, though his work was not widely known in Europe until much later. The technique eventually drove the development of more sophisticated ciphers, particularly the polyalphabetic cipher attributed to Blaise de Vigenère (1586), which uses multiple substitution alphabets to flatten the frequency distribution. The intellectual arms race between codemakers and codebreakers that al-Kindi initiated continues to this day, now involving the mathematics of prime factorization, elliptic curves, and quantum computing. Claude Shannon’s 1949 paper on the mathematical theory of communication formalized the relationship between frequency, redundancy, and the information content of a message — the same principles that al-Kindi exploited intuitively a millennium earlier.

How frequency analysis works

A simple substitution cipher replaces each letter of the alphabet with exactly one other letter. The key is a permutation of the 26 letters — for example, A→Q, B→W, C→E, and so on. There are 26! (about 4×10²⁶) possible keys, making brute-force search impossible even for modern computers. But frequency analysis cuts through this enormous keyspace by exploiting a fundamental weakness: the cipher preserves the statistical properties of the underlying language.

The procedure is straightforward. First, count the frequency of each letter in the ciphertext. Then compare these frequencies to the known distribution for the target language. The most frequent ciphertext letter probably encodes E; the next most frequent probably encodes T or A. After mapping the most common letters, you can look for common short words (the, and, is) and common letter patterns (TH, HE, IN, ER) to refine your guesses. Each correct mapping reveals more of the plaintext, which in turn suggests further mappings. The process is iterative and deeply satisfying — like solving a crossword puzzle where the clues are statistical.

This technique has direct connections to information theory. Shannon showed that English text has a redundancy of about 75% — meaning that most of the information in a message is predictable from context. This redundancy is precisely what makes frequency analysis possible: if English were perfectly random (every letter equally likely), there would be no frequency signature to exploit, and substitution ciphers would be unbreakable.

From al-Kindi to the Enigma

Al-Kindi’s breakthrough created a centuries-long arms race. Once frequency analysis became known, cipher designers responded with polyalphabetic systems that use multiple substitution alphabets in rotation. The Vigenère cipher (1586) was called “le chiffre indéchiffrable” and resisted cryptanalysis for nearly 300 years until Charles Babbage and Friedrich Kasiski independently broke it in the 1850s — again using frequency analysis, but applied to repeated key-length segments.

The ultimate expression of this arms race was the Enigma machine, used by Nazi Germany in World War II. Enigma was essentially a polyalphabetic cipher with an astronomical number of possible settings (approximately 1.59×10²⁰ daily key combinations). Yet the codebreakers at Bletchley Park, led by Alan Turing, found ways to exploit the same kind of statistical regularities that al-Kindi had identified. Repeated message structures, known plaintext fragments, and the fact that no letter could encrypt to itself all provided the statistical leverage needed to crack the system. The fundamental insight remained the same across twelve centuries: natural language is not random, and any cipher that preserves its structure can be broken by analyzing that structure statistically.