Actor-Critic

Actor-Critic methods combine policy gradients (the actor) with value function approximation (the critic). The critic estimates the state value V(s), which is used to compute the advantage A(s,a) = r + γV(s') − V(s). The actor updates its policy using this advantage instead of raw returns, dramatically reducing variance while introducing manageable bias. The two-network architecture enables stable online learning: the critic provides a low-variance baseline, while the actor explores and improves. Modern variants like A3C, PPO, and SAC all build on this foundation. Here a cart-pole-inspired agent balances a pendulum — the heatmap shows the critic's learned value function across state space.