Attention Mechanism

Scaled dot-product attention computes Attention(Q,K,V) = softmax(QK⊤/√d_k)V. Each token generates a query q, a key k, and a value v by linearly projecting the input embedding. The attention score between positions i and j is q_i·k_j/√d_k, measuring how "relevant" token j is to token i. The softmax normalization converts scores into probabilities (attention weights), and the output is a weighted sum of values. The √d_k scaling prevents the dot products from growing large in high dimensions, which would push softmax into saturation. Multi-head attention runs h parallel attention functions on projected subspaces, allowing the model to jointly attend to information from different representation subspaces.

ATTENTION MECHANISM