Transformer — Scaled Dot-Product Attention
Attention(Q,K,V) = softmax(QKᵀ / √dₖ) · V
Settings
Sequence length:
6
Key dimension dₖ:
8
Temperature scale:
1/√dₖ (standard)
1/1 (no scaling)
1/(dₖ/2) (hot)
Randomize Q,K,V
Sparse pattern
Uniform attention
Q
— query vectors
K
— key vectors
V
— value vectors
Scores = QKᵀ
Scaled: scores/√dₖ
Weights = softmax(scaled)
Output = weights · V
Click a token row to highlight its attention pattern.