Q-learning agent navigating a grid world — watch values and policy evolve through exploration.
Q-learning maintains a table of action values Q(s,a) updated via the Bellman equation: Q(s,a) ← Q(s,a) + α[r + γ·max Q(s',a') - Q(s,a)]. The ε-greedy policy balances exploration vs exploitation, with ε decaying over time as the agent learns optimal paths.