Neural networks learn structured representations by updating their hidden layers — unlike kernel methods (NTK) which freeze features at initialization. This is the "feature learning" vs "lazy training" distinction.
0.05
6
5
Feature learning regime (finite width, large η): First-layer weights rotate to align with task-relevant features — the representation changes. Gradient flow through hidden layers creates new basis vectors. Lazy training / NTK regime (very wide network, small η): Weights barely move; the network behaves as a fixed kernel machine. No representation change occurs. Key insight (Yang & Hu 2021, μP): The transition is controlled by the learning-rate-to-width ratio. With "maximal update parameterization" (μP), feature learning persists at infinite width.