Lagrange multiplier β
1.00
β→0: max compression; β→∞: min compression
Channel noise σ
0.50
Source clusters K
4
I(X;T): bits
I(T;Y): bits
β·I(X;T)−I(T;Y):
Efficiency: %
IB Objective

min I(X;T) − β·I(T;Y)

Tishby et al. (2000): the optimal T is found by alternating BA-like iterations.

In deep learning (Tishby & Schwartz-Ziv 2017): each layer is a compressed representation. Training = moving along the IB curve — first fitting (I(T;Y) increases), then compression (I(X;T) decreases).