DOPAMINE & REWARD PREDICTION ERROR

temporal difference learning, Schultz experiment, RPE = R + γV(s') − V(s)
TD Learning:
δ = R + γV(s') - V(s)
V(s) ← V(s) + α·δ

Schultz (1997) found that
DA neurons fire to:
• Unexpected reward
• Reward-predicting cue
(not reward, once learned)

Trial: 0
Last RPE:
V(cue):
V(reward):