Exploration-Exploitation Tradeoff
ε-Greedy · UCB · Thompson Sampling · Boltzmann — Compared Live
Arms k:
8
ε (greedy):
0.10
UCB c:
2.0
Boltzmann τ:
0.5
Steps/frame:
20
Start
New Arms
Reset Stats
Pulls: 0
All four strategies run simultaneously on the same k-arm bandit. Regret = optimal − achieved reward.
ε-Greedy
UCB
Thompson
Boltzmann