Attention Mechanism Explorer

Deep dive into transformer attention with interactive visualizations

← Back to Interactive Learning

Interactive Controls

Tokens: 6

Parallel attention mechanisms

View specific head pattern

d_model (embedding size)

d_k = d_model / num_heads

Controls attention sharpness

Mathematical Formulation

Scaled Dot-Product Attention

Attention(Q, K, V) = softmax(QK^T / √d_k)V

where Q ∈ ℝ^(n×d_k), K ∈ ℝ^(m×d_k), V ∈ ℝ^(m×d_v)

Multi-Head Attention

MultiHead(Q, K, V) = Concat(head_1, ..., head_h)W^O

where head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)

W_i^Q ∈ ℝ^(d_model×d_k), W_i^K ∈ ℝ^(d_model×d_k), W_i^V ∈ ℝ^(d_model×d_v), W^O ∈ ℝ^(hd_v×d_model)

Positional Encoding

PE_(pos,2i) = sin(pos / 10000^(2i/d_model))

PE_(pos,2i+1) = cos(pos / 10000^(2i/d_model))

where pos is position and i is dimension

Complexity Analysis

Time Complexity: O(n²·d) for sequence length n, dimension d

Memory Complexity: O(n²) for storing attention matrix

Self-attention captures all pairwise interactions

Key Concepts

Query, Key, Value

The attention mechanism uses three learned projections of the input:

  • Query (Q): What information am I looking for?
  • Key (K): What information do I contain?
  • Value (V): What information should I pass on?

The dot product of Q and K determines attention weights, which are applied to V.

Scaling Factor

Attention scores are divided by √d_k to prevent the softmax from saturating:

  • Large dot products lead to tiny gradients after softmax
  • Scaling maintains stable gradients during training
  • Temperature can further control attention sharpness

Lower temperature → sharper attention (more focused)

Multi-Head Benefits

Multiple attention heads learn different relationships:

  • Syntactic heads: Grammar and structure
  • Semantic heads: Meaning and context
  • Positional heads: Local vs. global patterns
  • Task heads: Problem-specific patterns

Heads operate in parallel, then concatenate and project.

Positional Information

Since attention is permutation-invariant, position must be injected:

  • Sinusoidal encodings allow extrapolation
  • Different frequencies for different dimensions
  • Relative positions can be computed from absolute
  • Learned embeddings are an alternative

Enables the model to leverage sequence order.

Attention Variants

Self-Attention

Q, K, V all come from the same sequence. Used in encoder and decoder layers.

Cross-Attention

Q from decoder, K & V from encoder. Enables sequence-to-sequence tasks.

Masked Attention

Prevents attending to future positions. Essential for autoregressive generation.

Computational Efficiency

Standard attention has O(n²) complexity, which can be prohibitive for long sequences:

  • Sparse Attention: Attend to fixed patterns (local, strided, random)
  • Linear Attention: Kernel methods to reduce to O(n)
  • Flash Attention: IO-aware implementation for hardware efficiency
  • Sliding Window: Fixed context window (e.g., Mistral)