Information Bottleneck — Tishby Tradeoff

Optimal compression: minimize I(X;T) while preserving I(T;Y) — the information curve

Distribution Setup

Input clusters (|X|) 8 Output classes (|Y|) 3 Bottleneck size (|T|) 4 β (compression ↔ relevance) 2.0 Distribution shape

Setup ready.

The information bottleneck principle (Tishby et al. 1999): find a compressed representation T of X that retains maximum information about Y.

min_{p(t|x)} I(X;T) − β I(T;Y)

The IB curve traces Pareto-optimal solutions as β varies. At β→0: T is maximally compressed (ignores Y). At β→∞: T = X (no compression).

Connection to deep learning: hidden layers can be analyzed as points on this curve (Schwartz-Ziv & Tishby 2017). Rate-distortion theory: optimal codes lie on this curve.

Rate-distortion

Minimum sufficiency

Information Bottleneck — Tishby's Tradeoff

Distribution Setup