Transformer — Scaled Dot-Product Attention

Attention(Q,K,V) = softmax(QKᵀ / √dₖ) · V

Settings

Q — query vectors
K — key vectors
V — value vectors

Scores = QKᵀ
Scaled: scores/√dₖ
Weights = softmax(scaled)
Output = weights · V

Click a token row to highlight its attention pattern.