Attention Mechanisms in NLP

Introduction

Attention mechanisms revolutionized NLP by allowing models to focus on relevant parts of the input when making predictions. This breakthrough enabled handling long-range dependencies and became the foundation of transformer models.

← Previous: Sequence Models → Next: Transformers ← Related: Contextual Embeddings

What is Attention?

Attention mechanisms allow models to dynamically focus on different parts of the input sequence when processing each element. Instead of encoding the entire sequence into a fixed-size vector, attention computes weighted combinations of all positions.

Core Concept:

For each output position, attention assigns weights to all input positions, indicating how much to "attend" to each input when generating that output.

Interactive Attention Visualization

Select token to query:

Attention Type:

Show details

Attention Weights

Self-Attention Matrix

Multi-Head Attention

Number of attention heads: 4

Multi-head attention runs multiple attention mechanisms in parallel, each potentially learning different relationships (syntactic, semantic, positional).

Types of Attention

Scaled Dot-Product

score = Q·K^T / √d_k

• Used in Transformers
• Efficient matrix operations
• Scaling prevents gradient issues

Additive (Bahdanau)

score = v^T·tanh(W_q·q + W_k·k)

• Original attention mechanism
• Used in early seq2seq models
• More parameters to learn

Self-Attention

Q = K = V = input

• Attention within same sequence
• Core of Transformer models
• Captures internal dependencies

Attention in Different Architectures

Seq2Seq with Attention

Original application in neural machine translation. Decoder attends to all encoder states instead of just the final hidden state.

Encoder → [h1, h2, ..., hn] → Attention → Decoder

Transformer Architecture

Relies entirely on attention mechanisms, no recurrence or convolution. Uses self-attention in both encoder and decoder.

Input → Self-Attention → Feed-Forward → Output

Cross-Attention

Used in encoder-decoder models where decoder queries attend to encoder keys and values. Essential for tasks like translation and summarization.

Q (decoder) × K,V (encoder) → Context

Why Attention Works

Long-range Dependencies

Direct connections between any two positions, regardless of distance. No vanishing gradient problem like in RNNs.

Parallelization

All positions can be processed simultaneously, unlike sequential RNN processing. Enables efficient training on GPUs.

Interpretability

Attention weights provide insight into what the model is "looking at" when making predictions.

Dynamic Computation

Different attention patterns for different inputs. Model learns what to focus on based on context.

Key Takeaways

Attention allows models to focus on relevant parts of the input dynamically
Scaled dot-product attention is the most common variant in modern models
Multi-head attention enables learning different types of relationships
Self-attention is the core mechanism in transformer architectures
Attention weights provide interpretability into model decisions
Enables parallelization and handles long-range dependencies effectively