TRANSFORMER CIRCUIT

Mechanistic interpretability: tracing information flow through transformer layers

3
2
0.50
Transformer circuits (mechanistic interpretability) decomposes transformer computation into understandable components. The key insight is the residual stream: each layer reads from and writes to a central stream x, so the full computation is x_final = x_0 + ∑_l (Attn_l(x) + MLP_l(x)). This means layers communicate via the residual stream rather than through each other directly. Induction heads are circuits that copy previous context by attending to the previous occurrence of the current token. Superposition allows n features to be represented in d < n dimensions by exploiting near-orthogonality in high dimensions. The circuit view reveals which heads and neurons are responsible for specific capabilities — a path toward understanding what transformers actually compute.