Gradient Descent Optimizers

Controls

Learning rate α: 0.05 Momentum β: 0.9 Noise σ (stochastic): 0.3

SGD (plain)

SGD + Momentum

Adam

SGD: θ ← θ − α∇L + noise
Momentum: v ← βv − α∇L; θ ← θ + v
Adam: adaptive per-parameter LR using first and second moment estimates (m̂/√v̂ + ε)

Adam is often fastest in flat regions; momentum overshoots sharp valleys; SGD noise helps escape local minima.