POLICY GRADIENT

REINFORCE: learning to act by maximizing expected return

0.010
0.90
4
0.05
Policy Gradient methods optimize a stochastic policy π(a|s;θ) directly by computing the gradient of expected cumulative reward. The REINFORCE algorithm (Williams 1992) estimates this gradient as ∇J(θ) = E[G_t · ∇log π(a_t|s_t;θ)], where G_t is the discounted return. Intuitively, actions that lead to high returns get their probabilities increased, while low-return actions decrease. The discount factor γ controls how much future rewards matter. An entropy bonus encourages exploration by penalizing overconfident policies. This visualization shows an agent (gold dot) navigating a 2D field to reach targets (cyan) while avoiding hazards (red), with the policy distribution shown as a directional field.