MDP Value Iteration — Grid World

Iter: 0 Click: cycle reward/wall/clear
Value iteration solves the Bellman optimality equation V*(s) = max_a Σ P(s'|s,a)[R + γV*(s')] by repeated sweeps. The optimal policy arrows emerge as value propagates from high-reward states.