Transformer Attention

Attention: softmax(QK^T / √d_k) · V. The query vector Q asks "what do I need?", key vectors K answer "what do I offer?", and V carries the actual content. Temperature controls sharpness: high temperature = uniform attention, low = peaked. Multi-head attention runs H parallel attention functions, each learning different relationships.

Transformer Attention Mechanism