Attention Mechanism Explorer
Deep dive into transformer attention with interactive visualizations
Interactive Controls
Tokens: 6
Parallel attention mechanisms
View specific head pattern
d_model (embedding size)
d_k = d_model / num_heads
Controls attention sharpness
Mathematical Formulation
Scaled Dot-Product Attention
Attention(Q, K, V) = softmax(QK^T / √d_k)V
where Q ∈ ℝ^(n×d_k), K ∈ ℝ^(m×d_k), V ∈ ℝ^(m×d_v)
Multi-Head Attention
MultiHead(Q, K, V) = Concat(head_1, ..., head_h)W^O
where head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)
W_i^Q ∈ ℝ^(d_model×d_k), W_i^K ∈ ℝ^(d_model×d_k), W_i^V ∈ ℝ^(d_model×d_v), W^O ∈ ℝ^(hd_v×d_model)
Positional Encoding
PE_(pos,2i) = sin(pos / 10000^(2i/d_model))
PE_(pos,2i+1) = cos(pos / 10000^(2i/d_model))
where pos is position and i is dimension
Complexity Analysis
Time Complexity: O(n²·d) for sequence length n, dimension d
Memory Complexity: O(n²) for storing attention matrix
Self-attention captures all pairwise interactions
Key Concepts
Query, Key, Value
The attention mechanism uses three learned projections of the input:
- Query (Q): What information am I looking for?
- Key (K): What information do I contain?
- Value (V): What information should I pass on?
The dot product of Q and K determines attention weights, which are applied to V.
Scaling Factor
Attention scores are divided by √d_k to prevent the softmax from saturating:
- Large dot products lead to tiny gradients after softmax
- Scaling maintains stable gradients during training
- Temperature can further control attention sharpness
Lower temperature → sharper attention (more focused)
Multi-Head Benefits
Multiple attention heads learn different relationships:
- Syntactic heads: Grammar and structure
- Semantic heads: Meaning and context
- Positional heads: Local vs. global patterns
- Task heads: Problem-specific patterns
Heads operate in parallel, then concatenate and project.
Positional Information
Since attention is permutation-invariant, position must be injected:
- Sinusoidal encodings allow extrapolation
- Different frequencies for different dimensions
- Relative positions can be computed from absolute
- Learned embeddings are an alternative
Enables the model to leverage sequence order.
Attention Variants
Self-Attention
Q, K, V all come from the same sequence. Used in encoder and decoder layers.
Cross-Attention
Q from decoder, K & V from encoder. Enables sequence-to-sequence tasks.
Masked Attention
Prevents attending to future positions. Essential for autoregressive generation.
Computational Efficiency
Standard attention has O(n²) complexity, which can be prohibitive for long sequences:
- Sparse Attention: Attend to fixed patterns (local, strided, random)
- Linear Attention: Kernel methods to reduce to O(n)
- Flash Attention: IO-aware implementation for hardware efficiency
- Sliding Window: Fixed context window (e.g., Mistral)