Information Bottleneck

β (relevance weight) = 1.0 Cluster width σ = 0.8

I(X;T): — bits

I(T;Y): — bits

β: 1.0

Minimize: I(X;T) − β·I(T;Y)

β → 0: max compression
β → ∞: max relevance

IB curve = optimal
trade-off between
compression & prediction

Information Bottleneck — Tishby's Principle