Reinforcement Learning

Q-learning agent navigating a grid world — watch values and policy evolve through exploration.

Episode: 0 Steps: 0 ε: 1.00 Reward: 0

Speed: 10x α: 0.3 γ: 0.9 Grid:

Q-learning maintains a table of action values Q(s,a) updated via the Bellman equation: Q(s,a) ← Q(s,a) + α[r + γ·max Q(s',a') - Q(s,a)]. The ε-greedy policy balances exploration vs exploitation, with ε decaying over time as the agent learns optimal paths.