Gradient Descent Landscape

Three optimizers descend the same loss surface simultaneously. SGD takes raw gradient steps and bounces in ravines. Adam adapts its step size per dimension and navigates more smoothly. Natural gradient accounts for the geometry of the parameter manifold — it moves farther when the landscape is flat and shorter when it curves sharply.

SGD · Adam (Kingma & Ba, 2014) · natural gradient (Amari, 1998) · click canvas to set start

SGD step 0 loss —

Adam step 0 loss —

Natural step 0 loss —

running

click to set starting position

loss landscape

learning rate 0.010

speed 5

SGD — stochastic gradient descent

The simplest possible update rule: step in the direction of steepest descent. In a ravine — a valley that curves sharply in one direction and gently in another — SGD overshoots side to side while making slow progress along the valley floor. It has no memory of past gradients and no awareness of local curvature.

Adam — adaptive moment estimation

Adam maintains running estimates of the mean and variance of each gradient coordinate. It rescales the step size per dimension — taking larger steps where gradients are small and reliable, smaller steps where they're noisy. The result is something that feels like the optimizer learned the shape of the ravine: it navigates the narrow dimension carefully while striding confidently along the valley.

Natural gradient — information geometry

The natural gradient, introduced by Shun-ichi Amari, preconditions the gradient by the inverse Fisher information matrix — a measure of how much each parameter direction actually changes the model's predictions. Where Adam adapts to gradient magnitudes, the natural gradient adapts to the statistical geometry of the model family itself. This version uses a diagonal Fisher approximation: it's the conceptual ancestor of Adam and shares many of its qualitative behaviors. The deeper idea is explored in The hidden geometry inside Adam.

Try the Rosenbrock landscape — the classic banana-shaped test function — to see where SGD struggles most. The ravine preset isolates the oscillation problem. Saddle points reveal how each optimizer escapes a local maximum in one direction while descending in another.