Q-Learning

Q-Learning (Watkins 1989) is an off-policy temporal difference algorithm that learns the optimal action-value function Q*(s,a) without requiring a model of the environment. The update rule is: Q(s,a) ← Q(s,a) + α[r + γ·max_a' Q(s',a') − Q(s,a)], where the term in brackets is the TD error. The key insight is using the greedy max over next-state values, making it off-policy — it learns the optimal policy regardless of the exploration strategy used. The Q-table cells are colored by their maximum Q-value: warm colors indicate high-value states the agent has learned to prefer. Watch the value function propagate from the goal (cyan star) backward through the maze.