Linear Attention vs Softmax Attention

512
64
Linear: sim(Q,K) = φ(Q)ᵀφ(K) — associative trick turns O(N²) into O(N·m) via kernel decomposition