Information Bottleneck
Compression vs. relevant information — the IB curve
Information Bottleneck (Tishby, Pereira & Bialek 1999): Given input X and relevant variable Y, find a compressed representation T that minimizes I(X;T) while maximizing I(T;Y). The Lagrangian is L = I(X;T) − β·I(T;Y). The IB curve traces achievable (I(X;T), I(T;Y)) pairs as β varies — it is concave and bounded by H(Y). The plane shows layerwise mutual information during training (Tishby & Schwartz-Ziv 2017): layers first fit (I(T;Y) rises) then compress (I(X;T) decreases). This "two-phase" learning is debated but influential.