Attention allows the model to focus on relevant parts of the input when producing each output
Problems Solved:
• Information bottleneck - No single vector limitation
• Long sequences - Direct connections help
• Alignment - Learn what to focus on
• Interpretability - See what model attends to
🧠 Human Analogy
Like focusing on specific words when translating
How It Works
🎯 Dynamic Context Vector
Different for each output position
🔍 Soft Alignment
Learn where to look automatically
🔄 End-to-End Learning
Attention weights learned during training
📊 Interpretable
Visualize what model focuses on
Attention in Action
The
cat
sat
on
mat
Generating: "Le chat"
attention(Q, K, V) = softmax(QK^T / √d_k)V
Note: Attention weights show the model focusing on "cat" when generating "chat"