REINFORCE Policy Gradient
Monte Carlo Policy Gradient — Pendulum Balancing
Learning Rate α:
0.01
γ (discount):
0.99
Episodes/update:
5
Max steps/ep:
200
Temperature τ:
1.0
Train
Reset
Episodes: 0
Avg Return: —
θ[0]: —
θ[1]: —
Pendulum: state (angle θ, ω). Action: push left/right. Policy: softmax over linear features. REINFORCE updates θ using ∇log π · G.