SGD on a Non-Convex Loss Landscape

Stochastic Gradient Descent adds noise σ to gradients. This noise acts as temperature in a Langevin equation, allowing escape from local minima. Increasing noise (batch size ↓) can improve generalization by finding "flat" minima.

Step 0