Information geometry and the curvature of belief

The space of all probability distributions over a fixed set of outcomes can be treated as a geometric object. This is not a metaphor. The distributions form a manifold — a smooth space with local coordinates — and there is a natural way to define distance, angles, and curvature on it. The resulting structure is called information geometry, and it has consequences for statistics, machine learning, and the philosophy of inference that took decades to fully appreciate.

The key ingredient is the Fisher information matrix. If you have a parametric family of distributions — say, all Gaussian distributions, indexed by mean and variance — the Fisher information at a point tells you how rapidly the distributions change as you vary the parameters. More precisely, it measures how much information a single observation carries about the parameters. The Fisher matrix defines a metric: a way of assigning lengths to paths through the space of distributions, such that paths through regions where the distribution changes quickly are longer than paths through regions where it changes slowly. This metric, proposed by Calyampudi Radhakrishna Rao in 1945 and developed systematically by Shun-ichi Amari starting in the 1980s, gives the manifold of distributions a genuine Riemannian geometry.

The geometry is not flat. Different regions of the distribution manifold have different curvature, and this curvature is not an artifact of how you parametrize the distributions — it is intrinsic, coordinate-invariant. This matters because curvature affects what you can do. In flat space, the shortest path between two points is a straight line. On a curved manifold, the shortest paths — geodesics — are curved, and there are phenomena, like parallel transport, that have no flat-space analog. The Cramér-Rao bound, one of the most fundamental results in statistics, has a clean geometric interpretation: it says that the variance of any unbiased estimator is at least as large as the inverse of the Fisher information, which is a statement about the minimum distance you can move in the distribution manifold per unit of observed data.

There are two natural coordinate systems on the manifold, dual to each other under the Fisher metric. The exponential family distributions — Gaussians, Poisson, gamma, and many others — form a flat submanifold in one coordinate system (called e-flat), and their mixture families form a flat submanifold in the other (m-flat). The duality between these coordinate systems gives the geometry a richer structure, and computations that are hard in one system are easy in the other. This duality has a statistical interpretation: e-flatness corresponds to independence-like structure, m-flatness corresponds to mixture-like structure, and the way they fit together describes the geometry of inference.

One application of this structure is natural gradient descent, developed by Amari. Ordinary gradient descent moves in the direction of steepest descent in parameter space, but parameter space is not the space that matters — the space of distributions is. If you change a parameter in a region where the distribution changes slowly, you have not moved far in distribution space; if you change the same parameter by the same amount where the distribution changes rapidly, you have moved far. Ordinary gradient ignores this distinction. Natural gradient descent corrects for it by using the Fisher metric to measure the true steepness of the loss in distribution space. The result is an optimization algorithm that is invariant to reparametrization and converges faster in many practical settings.

What I find most striking about information geometry is what it implies about the nature of inference itself. To update your beliefs — to move from prior to posterior — is to move along a path in the manifold of distributions. The Fisher metric tells you how much information you need to traverse a given stretch of that path. Regions of high curvature are regions where your beliefs are sensitive: small changes in the data move you a long way in distribution space. Regions of low curvature are regions where you are relatively insensitive, where large amounts of data only slowly shift the posterior.

The geometry is not just a tool for computation; it is a description of the structure of rational belief change. The fact that this structure has curvature — that the space of beliefs is not flat — means that inference is not translation. Moving from prior to posterior is not just adding a vector; it depends on where you are. The shape of the manifold at your current position constrains how you can move and how far you go. In a literal geometric sense, what you believe shapes how you can change what you believe. That seems like something worth knowing.