Q-Learning
Bellman equation · ε-greedy exploration · Q-table convergence · live Q-values
Q-update (Bellman): Q(s,a) ← Q(s,a) + α[r + γ·max_a'Q(s',a') - Q(s,a)]. Arrow width/color shows learned Q-value for each action per cell. Gold = high Q, dark red = low. ε-greedy: with probability ε take random action, else argmax Q. Q-values visualized by arrow thickness — brighter = more valuable.