Actor-Critic with Advantage

α_actor: 0.05 α_critic: 0.10 γ: 0.95 Steps/frame: 20

Steps: 0
Avg Return: —
Last Advantage: —

Actor (π): policy network updated by advantage-weighted gradient. Critic (V): value function trained by TD error δ = r + γV(s') − V(s). Advantage A = δ (one-step approx).

Critic V(s)

Advantage A>0

Advantage A≤0