Linear Attention vs Softmax Attention
Max N:
512
d (feature dim):
64
Update
Linear: sim(Q,K) = φ(Q)ᵀφ(K) — associative trick turns O(N²) into O(N·m) via kernel decomposition