Huffman coding

In 1952, David Huffman was a graduate student at MIT taking an information theory course with Robert Fano. The final exam offered a choice: take a traditional test, or find the most efficient binary code. Fano himself had been trying — his Shannon-Fano coding was good but not provably optimal. Huffman, working from the bottom up instead of the top down, found the answer. His term paper became one of the most cited in the history of computer science.

Huffman coding is a variable-length, prefix-free encoding. Instead of giving every character the same number of bits (like ASCII’s 8 bits per character), it assigns shorter codes to more frequent characters and longer codes to rarer ones. The result is lossless compression — the original data can be perfectly reconstructed from the compressed form.

The key property is that the code is prefix-free: no code word is a prefix of any other code word. This means the encoded bitstream can be decoded unambiguously, reading bit by bit, without needing any delimiters between characters.

The algorithm is greedy and elegant. Start by counting the frequency of each character. Create a leaf node for each character and place all nodes in a priority queue ordered by frequency (lowest first). Then repeat: remove the two nodes with the lowest frequency, create a new internal node whose frequency is their sum, make the two removed nodes its children, and insert the new node back into the queue. Continue until only one node remains — that is the root of the Huffman tree.

To read off the codes, walk from the root to each leaf. Every time you go left, append “0”; every time you go right, append “1”. The resulting binary string is that character’s Huffman code. Characters deep in the tree get longer codes; characters near the root get shorter ones. And because the tree is built from the bottom up, the rarest characters end up deepest.

The greedy strategy — always merging the two smallest frequencies — produces a provably optimal prefix-free code. No other prefix-free code can compress the data further, given the same character frequencies.

A code is prefix-free (also called an instantaneous code) if no code word is a prefix of any other. This is what makes Huffman coding immediately decodable: as you read the bitstream left to right, the moment you match a valid code word, you know you have found a character boundary. You never need to look ahead.

Compare this with Morse code, which is not prefix-free. In Morse, the code for “E” is a single dot, which is a prefix of the code for “I” (two dots). Without the pauses between letters, a Morse signal would be ambiguous. Huffman codes never have this problem.

Every prefix-free code corresponds to a binary tree where code words are leaf nodes. No leaf can be an ancestor of another leaf, which is precisely the prefix-free property. This is why the Huffman tree works — every character sits at a leaf, and the path from root to leaf spells out the code.

Shannon’s 1948 entropy formula gives the theoretical minimum average bits per character: H = -Σ p(x) log&sub2; p(x), where p(x) is the probability of each character. No lossless compression scheme can do better than this on average. Huffman coding gets within one bit per character of this bound — and often much closer.

When does Huffman coding struggle? When the alphabet is very small. With only two characters, each code word must be at least one bit, even if one character is far more common. In the extreme case of a single character repeated, no compression is possible with Huffman coding alone, even though the entropy is zero. This is the gap that arithmetic coding fills — it can achieve fractional bits per character, approaching entropy even for small alphabets.

For large alphabets with frequencies that are powers of 1/2 (like 1/2, 1/4, 1/8...), Huffman coding achieves entropy exactly. The further the frequencies are from powers of two, the larger the gap. But in practice, for natural language text, Huffman coding typically achieves compression within a few percent of the theoretical minimum.

Huffman coding is everywhere, usually as one stage in a larger compression pipeline. JPEG images use Huffman coding after the discrete cosine transform and quantization steps. MP3 audio uses it after the modified DCT and psychoacoustic model. DEFLATE, the algorithm behind ZIP files and gzip, combines LZ77 dictionary coding with Huffman coding.

Fax machines use a variant called modified Huffman coding, optimized for the runs of black and white pixels typical of scanned documents. The HTTP/2 protocol uses Huffman coding for header compression (HPACK). PNG images use DEFLATE, which uses Huffman. It is the most widely deployed compression algorithm in human history.

Arithmetic coding can outperform Huffman on sources with very skewed distributions or small alphabets, because it can assign fractional bits per symbol. Modern formats like JPEG 2000 and H.265/HEVC use arithmetic coding (or its patent-free cousin, ANS) instead. But Huffman remains dominant where simplicity, speed, and patent-free status matter.

Huffman Coding