Q-Learning

Bellman equation · ε-greedy exploration · Q-table convergence · live Q-values

Learning Rate α: 0.10 Discount γ: 0.90 Explore ε: 0.20

Episode: 0

Step: 0

Total Reward: 0

Avg Reward: —

Agent pos: (0,0)

Action: —

Q-update (Bellman): Q(s,a) ← Q(s,a) + α[r + γ·max_a'Q(s',a') - Q(s,a)]. Arrow width/color shows learned Q-value for each action per cell. Gold = high Q, dark red = low. ε-greedy: with probability ε take random action, else argmax Q. Q-values visualized by arrow thickness — brighter = more valuable.