Attention Mechanism Explorer

Deep dive into transformer attention with interactive visualizations

← Back to Interactive Learning

Interactive Controls

Input Text

Tokens: 6

Number of Heads: 4

Parallel attention mechanisms

Selected Head: 1

View specific head pattern

Model Dimension: 512

d_model (embedding size)

Head Dimension: 64

d_k = d_model / num_heads

Temperature: 1.0

Controls attention sharpness

Show MathUse Masking

Mathematical Formulation

Scaled Dot-Product Attention

Attention(Q, K, V) = softmax(QK^T / √d_k)V

where Q ∈ ℝ^(n×d_k), K ∈ ℝ^(m×d_k), V ∈ ℝ^(m×d_v)

Multi-Head Attention

MultiHead(Q, K, V) = Concat(head_1, ..., head_h)W^O

where head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)

W_i^Q ∈ ℝ^(d_model×d_k), W_i^K ∈ ℝ^(d_model×d_k), W_i^V ∈ ℝ^(d_model×d_v), W^O ∈ ℝ^(hd_v×d_model)

Positional Encoding

PE_(pos,2i) = sin(pos / 10000^(2i/d_model))

PE_(pos,2i+1) = cos(pos / 10000^(2i/d_model))

where pos is position and i is dimension

Complexity Analysis

Time Complexity: O(n²·d) for sequence length n, dimension d

Memory Complexity: O(n²) for storing attention matrix

Self-attention captures all pairwise interactions

Key Concepts

Query, Key, Value

The attention mechanism uses three learned projections of the input:

Query (Q): What information am I looking for?
Key (K): What information do I contain?
Value (V): What information should I pass on?

The dot product of Q and K determines attention weights, which are applied to V.

Scaling Factor

Attention scores are divided by √d_k to prevent the softmax from saturating:

Large dot products lead to tiny gradients after softmax
Scaling maintains stable gradients during training
Temperature can further control attention sharpness

Lower temperature → sharper attention (more focused)

Multi-Head Benefits

Multiple attention heads learn different relationships:

Syntactic heads: Grammar and structure
Semantic heads: Meaning and context
Positional heads: Local vs. global patterns
Task heads: Problem-specific patterns

Heads operate in parallel, then concatenate and project.

Positional Information

Since attention is permutation-invariant, position must be injected:

Sinusoidal encodings allow extrapolation
Different frequencies for different dimensions
Relative positions can be computed from absolute
Learned embeddings are an alternative

Enables the model to leverage sequence order.

Attention Variants

Self-Attention

Q, K, V all come from the same sequence. Used in encoder and decoder layers.

Cross-Attention

Q from decoder, K & V from encoder. Enables sequence-to-sequence tasks.

Masked Attention

Prevents attending to future positions. Essential for autoregressive generation.

Computational Efficiency

Standard attention has O(n²) complexity, which can be prohibitive for long sequences:

Sparse Attention: Attend to fixed patterns (local, strided, random)
Linear Attention: Kernel methods to reduce to O(n)
Flash Attention: IO-aware implementation for hardware efficiency
Sliding Window: Fixed context window (e.g., Mistral)