SGD on a Non-Convex Loss Landscape

Stochastic Gradient Descent adds noise σ to gradients. This noise acts as temperature in a Langevin equation, allowing escape from local minima. Increasing noise (batch size ↓) can improve generalization by finding "flat" minima.

Learning rate η: 0.020 Noise σ: 0.050 Particles: 5

Step 0