Gradient Descent Optimizers

SGD, Momentum, Adam racing across a 2D loss landscape

Controls


SGD (plain)
SGD + Momentum
Adam
SGD: θ ← θ − α∇L + noise
Momentum: v ← βv − α∇L; θ ← θ + v
Adam: adaptive per-parameter LR using first and second moment estimates (m̂/√v̂ + ε)

Adam is often fastest in flat regions; momentum overshoots sharp valleys; SGD noise helps escape local minima.