Transformers & Attention Mechanisms

The Attention Revolution

Transformers eliminated the need for recurrence and convolutions in sequence modeling, relying entirely on attention mechanisms. This architecture powers modern language models and has transformed NLP, computer vision, and beyond.

Why Attention?

Traditional RNNs process sequences step-by-step, creating bottlenecks and losing long-range dependencies. Attention allows models to look at all positions simultaneously, capturing relationships regardless of distance.

Self-Attention Visualization

Click on any word to see its attention pattern. Notice how the model learns to attend to semantically related words regardless of their position.

Attention Weights for "sat":

Self-Attention Mechanism

The core innovation: each position attends to all positions in the input sequence, learning which parts are relevant for understanding the current position.

The Process:

Create Q, K, V: Linear projections of the input
Calculate Scores: Dot product of Q with all K
Apply Softmax: Normalize scores to get attention weights
Weighted Sum: Multiply V by attention weights

Attention(Q,K,V) = softmax(QK^T/√d_k)V

Multi-Head Attention

Each attention head learns different patterns. Hover over a head to see its attention matrix in detail.

Head 0 (syntactic):

Focuses on grammatical relationships

Head 1 (positional):

Attends based on token distance

Head 2 (semantic):

Captures meaning relationships

Head 3 (global):

Broad attention across sequence

Multi-Head Attention

Instead of one attention function, transformers use multiple "heads" that can attend to different types of relationships. This allows the model to capture various aspects of the data simultaneously.

Each head learns different attention patterns
Heads are computed in parallel
Results are concatenated and projected

Transformer Architecture

Encoder Stack

Multi-head self-attention
Position-wise feed-forward network
Residual connections and layer normalization

Decoder Stack

Masked multi-head self-attention
Encoder-decoder attention
Position-wise feed-forward network

Positional Encoding Patterns

Sinusoidal positional encodings create unique patterns for each position. Lower dimensions change rapidly, higher dimensions change slowly.

Dimensions: 20

Max Position: 50

Key Properties:

• Each position has a unique encoding
• Relative positions have consistent patterns
• Model can extrapolate to longer sequences
• No learned parameters needed

Positional Encoding

Since attention has no inherent notion of position, transformers add positional encodings to give the model information about the order of tokens in the sequence.

PE(pos,2i) = sin(pos/10000^(2i/d_model)) PE(pos,2i+1) = cos(pos/10000^(2i/d_model))

Key Advantages

Parallelization: All positions processed simultaneously
Long-range Dependencies: Direct connections between distant positions
Interpretability: Attention weights show what the model focuses on
Flexibility: Works for various sequence lengths and modalities

Impact & Applications

NLP: BERT, GPT, T5 - revolutionized language understanding
Computer Vision: Vision Transformer (ViT) - images as sequences
Multimodal: CLIP, DALL-E - connecting text and images
Time Series: Forecasting and anomaly detection
Biology: Protein structure prediction (AlphaFold)