Actor-Critic with Advantage
A(s,a) = Q(s,a) − V(s) drives policy updates
α_actor:
0.05
α_critic:
0.10
γ:
0.95
Steps/frame:
20
Train
Reset
Steps: 0
Avg Return: —
Last Advantage: —
Actor (π): policy network updated by advantage-weighted gradient. Critic (V): value function trained by TD error δ = r + γV(s') − V(s). Advantage A = δ (one-step approx).
Critic V(s)
Advantage A>0
Advantage A≤0