Information Bottleneck

min I(X;T) − β·I(T;Y)

β (compression tradeoff) 2.0 |X| (source alphabet) 4 |Y| (target alphabet) 3 Noise level 0.2

I(X;Y) — total info

I(X;T) — compression

I(T;Y) — relevance

IB efficiency

p(X,Y) joint distribution

Each row is a source symbol X=x. Each column is a target Y=y. The IB algorithm finds an optimal T that clusters X values by relevance to Y.