Gradient Descent Optimizers
SGD, Momentum, Adam racing across a 2D loss landscape
Controls
Learning rate α:
0.05
Momentum β:
0.9
Noise σ (stochastic):
0.3
Reset / Randomize start
Pause / Resume
New landscape
SGD (plain)
SGD + Momentum
Adam
SGD
: θ ← θ − α∇L + noise
Momentum
: v ← βv − α∇L; θ ← θ + v
Adam
: adaptive per-parameter LR using first and second moment estimates (m̂/√v̂ + ε)
Adam is often fastest in flat regions; momentum overshoots sharp valleys; SGD noise helps escape local minima.