Exploration-Exploitation Tradeoff

Arms k: 8 ε (greedy): 0.10 UCB c: 2.0 Boltzmann τ: 0.5 Steps/frame: 20

Pulls: 0

All four strategies run simultaneously on the same k-arm bandit. Regret = optimal − achieved reward.

ε-Greedy

UCB

Thompson

Boltzmann