Most practitioners think of Adam as gradient descent with a twist — the same old direction of steepest descent, but with the step size tuned differently for each parameter based on recent gradient history. This description is not wrong. But it is, I think, the wrong level to understand what is actually happening. There is a more precise description: Adam is doing geometry. It is approximately navigating a curved space, and the curvature it is navigating is the Fisher information of the model's output distribution. The adaptive learning rates are not a trick — they are an approximation to something mathematically natural.
Start with the space of models. A neural network with parameters θ is really a function from inputs to probability distributions — via softmax, or a Gaussian output head, or whatever the final layer produces. As you move through parameter space, you move through a family of distributions. But here is the thing: the Euclidean distance between two parameter vectors tells you almost nothing about how different the corresponding models actually are. You can make tiny moves in parameter space that drastically change the output distribution, and large moves that change almost nothing, depending on the local geometry. The parameters are just coordinates, and coordinates are arbitrary. The distributions are what matter.
The natural measure of distance between distributions is the Kullback-Leibler divergence. If you expand KL(pθ || pθ+dθ) in a Taylor series around θ, the first-order term vanishes and the second-order term is dθT F(θ) dθ, where F(θ) is the Fisher information matrix. This is not a coincidence or a definition chosen for convenience — it is the unique Riemannian metric on the space of probability distributions that is invariant to reparameterization. The Fisher information is, in a precise sense, the only way to measure distance on a statistical manifold that does not depend on how you happened to write down the model. It is the geometry of the distributions themselves, not of the coordinates you use to describe them.
This matters for optimization because the direction of steepest descent in ordinary Euclidean space is not the direction of steepest descent in Fisher-information space. The natural gradient corrects for this: instead of moving in direction ∇L, you move in direction F(θ)-1∇L. The natural gradient is the Euclidean gradient premultiplied by the inverse Fisher information matrix, which warps the direction of descent to account for the curvature of the manifold. The result is invariant to reparameterization. If you decide to square all your parameters, or take their logs, the natural gradient step corresponds to the same move through distribution space. Ordinary gradient descent does not have this property: squaring your parameters gives you different dynamics, because the Euclidean gradient changes while the true geometry does not.
The problem is that computing the full Fisher inverse is expensive. The Fisher matrix has n2 entries for a model with n parameters, and modern networks have billions of parameters. The full computation is completely intractable. So you need an approximation. The most natural approximation is to keep only the diagonal — throw away the off-diagonal interactions between parameters, and invert only the diagonal entries. And when you look at what Adam is actually computing, you find that this is essentially what it does. The second moment term vt in Adam — the exponential moving average of squared gradients — is a diagonal approximation to the empirical Fisher information matrix. The division by sqrt(vt) in the parameter update is approximately premultiplying the gradient by the inverse diagonal Fisher.
The square root deserves its own moment of attention. Why sqrt(vt) rather than vt itself? The intuitive answer is usually something about stability, or analogy with RMSProp. But there is a more precise reason. When you compute an exponential moving average of a tensor quantity across different positions on a curved manifold, the correct way to average — the coordinate-invariant way — requires the square root. Using vt directly would be averaging the squared gradients as if they lived in a flat space, which they do not. The sqrt is the operation that makes the averaging commute with reparameterization. This was worked out carefully in the FAdam paper in 2024, which showed that the standard Adam update is not just heuristically reasonable but mathematically forced, given the goal of approximate natural gradient descent. The square root is not a happy accident. It is the geometrically correct thing to do.
Once you see Adam as diagonal Fisher approximation, its failure modes become diagnostic rather than mysterious. When Adam underperforms SGD — which happens, particularly on some vision tasks and recurrent architectures — the question to ask is not "why is Adam bad here?" but "where does the diagonal Fisher approximation break down?" The answer is: when the off-diagonal Fisher terms are large, meaning different parameters are strongly correlated in their effect on the output distribution. When you rotate your weight vectors, the correlations between parameters change — if those correlations are strong, the diagonal approximation is missing important structure. SGD, which has no approximation but also no geometry, can sometimes stumble through these landscapes more reliably precisely because it is not committed to an incorrect geometric model.
The natural extension is to approximate the Fisher more carefully without going all the way to the full matrix. Kronecker-Factored Approximate Curvature, K-FAC, does this by exploiting the layer structure of neural networks — the Fisher for a layer factorizes approximately as a Kronecker product of two smaller matrices, one involving activations and one involving pre-activation gradients. K-FAC captures correlations within layers that Adam's diagonal approximation misses, and in wall-clock time on large training runs, it consistently outperforms Adam. The geometry is the same; the approximation is better.
I find this pattern worth noticing. Adam was designed empirically — Kingma and Ba introduced it in 2014 as a practical method that worked well, with intuitive motivation but no precise geometric grounding. The geometric interpretation was reconstructed afterward, fitted to an existing success. This is common. A method gets developed because it works, and years later someone works out why it works, and the "why" turns out to be something clean and principled. The retrospective interpretation does not change how you use Adam. But it changes what you see when it fails, and what questions you ask about what better might look like. Knowing that Adam is approximate natural gradient descent tells you that the right successor to Adam is one that approximates the Fisher more accurately — not one that adds more heuristics, not one that tunes the hyperparameters differently. The geometry tells you where to look.
There is something I find genuinely satisfying about this — not just intellectually, but in a way that feels like more than aesthetic appreciation. The world keeps turning out to have structure that we only discover by looking from the right angle. Adam looked like arithmetic. It was geometry all along.