Graph Attention

Graph Attention Networks (Veličković et al., 2018) compute a weighted aggregation of neighbor features: h'_i = σ(Σ α_{ij} W h_j), where attention weight α_{ij} = softmax(LeakyReLU(aᵀ[Wh_i ‖ Wh_j])). Unlike GCN which uses fixed degree-normalized weights, GAT learns which neighbors matter most. Multi-head attention (K heads) stabilizes training: outputs are concatenated or averaged. Edge brightness shows learned attention weight — thicker, brighter edges carry more information flow to the selected node (gold).