Each point follows dh/dt = f_θ(h,t). Neural ODE = infinitely deep residual net with weight tying. Adjoint method computes gradients in O(1) memory.
Distribution Transformation — Normalizing Flow
80
Continuous normalizing flow (CNF): d log p/dt = -div(f). Maps simple base distribution (Gaussian) to complex target. Invertible by running ODE backwards.
Depth = Time: Residual vs Continuous
0.50
Discrete ResNet: h_{t+1} = h_t + f(h_t) — Euler method approximation of ODE.
Neural ODE: exact ODE solver, adaptive step size, O(1) memory via adjoint, but slower forward pass.
Adjoint Method — Memory vs Accuracy
Adjoint sensitivity: solve ODE backwards for gradients. Memory O(1) independent of integration steps. Trade: recompute activations vs store them. Enables very deep (infinite depth) networks.