Transformers & Attention Mechanisms

The Attention Revolution

Transformers eliminated the need for recurrence and convolutions in sequence modeling, relying entirely on attention mechanisms. This architecture powers modern language models and has transformed NLP, computer vision, and beyond.

Why Attention?

Traditional RNNs process sequences step-by-step, creating bottlenecks and losing long-range dependencies. Attention allows models to look at all positions simultaneously, capturing relationships regardless of distance.

Self-Attention Visualization

Click on any word to see its attention pattern. Notice how the model learns to attend to semantically related words regardless of their position.

Attention Weights for "sat":

Self-Attention Mechanism

The core innovation: each position attends to all positions in the input sequence, learning which parts are relevant for understanding the current position.

The Process:

  1. Create Q, K, V: Linear projections of the input
  2. Calculate Scores: Dot product of Q with all K
  3. Apply Softmax: Normalize scores to get attention weights
  4. Weighted Sum: Multiply V by attention weights
Attention(Q,K,V) = softmax(QK^T/√d_k)V

Multi-Head Attention

Each attention head learns different patterns. Hover over a head to see its attention matrix in detail.

Head 0 (syntactic):

Focuses on grammatical relationships

Head 1 (positional):

Attends based on token distance

Head 2 (semantic):

Captures meaning relationships

Head 3 (global):

Broad attention across sequence

Multi-Head Attention

Instead of one attention function, transformers use multiple "heads" that can attend to different types of relationships. This allows the model to capture various aspects of the data simultaneously.

  • Each head learns different attention patterns
  • Heads are computed in parallel
  • Results are concatenated and projected

Transformer Architecture

Encoder Stack

  • Multi-head self-attention
  • Position-wise feed-forward network
  • Residual connections and layer normalization

Decoder Stack

  • Masked multi-head self-attention
  • Encoder-decoder attention
  • Position-wise feed-forward network

Positional Encoding Patterns

Sinusoidal positional encodings create unique patterns for each position. Lower dimensions change rapidly, higher dimensions change slowly.

Key Properties:

  • • Each position has a unique encoding
  • • Relative positions have consistent patterns
  • • Model can extrapolate to longer sequences
  • • No learned parameters needed

Positional Encoding

Since attention has no inherent notion of position, transformers add positional encodings to give the model information about the order of tokens in the sequence.

PE(pos,2i) = sin(pos/10000^(2i/d_model)) PE(pos,2i+1) = cos(pos/10000^(2i/d_model))

Key Advantages

  • Parallelization: All positions processed simultaneously
  • Long-range Dependencies: Direct connections between distant positions
  • Interpretability: Attention weights show what the model focuses on
  • Flexibility: Works for various sequence lengths and modalities

Impact & Applications

  • NLP: BERT, GPT, T5 - revolutionized language understanding
  • Computer Vision: Vision Transformer (ViT) - images as sequences
  • Multimodal: CLIP, DALL-E - connecting text and images
  • Time Series: Forecasting and anomaly detection
  • Biology: Protein structure prediction (AlphaFold)