Information Bottleneck

Lagrange multiplier β

1.00

β→0: max compression; β→∞: min compression

Channel noise σ

0.50

Source clusters K

I(X;T): — bits
I(T;Y): — bits
β·I(X;T)−I(T;Y): —
Efficiency: —%

IB Objective

min I(X;T) − β·I(T;Y)

Tishby et al. (2000): the optimal T is found by alternating BA-like iterations.

In deep learning (Tishby & Schwartz-Ziv 2017): each layer is a compressed representation. Training = moving along the IB curve — first fitting (I(T;Y) increases), then compression (I(X;T) decreases).

Information Bottleneck Principle