Q·K^T/√d_k → softmax → weighted sum of V · multi-head attention
Click a token to query its attention pattern:
Attention: softmax(QK^T / √d_k) · V. The query vector Q asks "what do I need?", key vectors K answer "what do I offer?", and V carries the actual content. Temperature controls sharpness: high temperature = uniform attention, low = peaked. Multi-head attention runs H parallel attention functions, each learning different relationships.