Transformers & Attention Mechanisms
The Attention Revolution
Transformers eliminated the need for recurrence and convolutions in sequence modeling, relying entirely on attention mechanisms. This architecture powers modern language models and has transformed NLP, computer vision, and beyond.
Why Attention?
Traditional RNNs process sequences step-by-step, creating bottlenecks and losing long-range dependencies. Attention allows models to look at all positions simultaneously, capturing relationships regardless of distance.
Self-Attention Visualization
Click on any word to see its attention pattern. Notice how the model learns to attend to semantically related words regardless of their position.
Attention Weights for "sat":
Self-Attention Mechanism
The core innovation: each position attends to all positions in the input sequence, learning which parts are relevant for understanding the current position.
The Process:
- Create Q, K, V: Linear projections of the input
- Calculate Scores: Dot product of Q with all K
- Apply Softmax: Normalize scores to get attention weights
- Weighted Sum: Multiply V by attention weights
Attention(Q,K,V) = softmax(QK^T/√d_k)V
Multi-Head Attention
Each attention head learns different patterns. Hover over a head to see its attention matrix in detail.
Focuses on grammatical relationships
Attends based on token distance
Captures meaning relationships
Broad attention across sequence
Multi-Head Attention
Instead of one attention function, transformers use multiple "heads" that can attend to different types of relationships. This allows the model to capture various aspects of the data simultaneously.
- Each head learns different attention patterns
- Heads are computed in parallel
- Results are concatenated and projected
Transformer Architecture
Encoder Stack
- Multi-head self-attention
- Position-wise feed-forward network
- Residual connections and layer normalization
Decoder Stack
- Masked multi-head self-attention
- Encoder-decoder attention
- Position-wise feed-forward network
Positional Encoding Patterns
Sinusoidal positional encodings create unique patterns for each position. Lower dimensions change rapidly, higher dimensions change slowly.
Key Properties:
- • Each position has a unique encoding
- • Relative positions have consistent patterns
- • Model can extrapolate to longer sequences
- • No learned parameters needed
Positional Encoding
Since attention has no inherent notion of position, transformers add positional encodings to give the model information about the order of tokens in the sequence.
PE(pos,2i) = sin(pos/10000^(2i/d_model)) PE(pos,2i+1) = cos(pos/10000^(2i/d_model))
Key Advantages
- Parallelization: All positions processed simultaneously
- Long-range Dependencies: Direct connections between distant positions
- Interpretability: Attention weights show what the model focuses on
- Flexibility: Works for various sequence lengths and modalities
Impact & Applications
- NLP: BERT, GPT, T5 - revolutionized language understanding
- Computer Vision: Vision Transformer (ViT) - images as sequences
- Multimodal: CLIP, DALL-E - connecting text and images
- Time Series: Forecasting and anomaly detection
- Biology: Protein structure prediction (AlphaFold)