Actor-Critic with Advantage

A(s,a) = Q(s,a) − V(s) drives policy updates

Steps: 0
Avg Return: —
Last Advantage: —
Actor (π): policy network updated by advantage-weighted gradient. Critic (V): value function trained by TD error δ = r + γV(s') − V(s). Advantage A = δ (one-step approx).
Critic V(s)
Advantage A>0
Advantage A≤0