Back to Blog

Attention Mechanisms: From Seq2Seq to Multi-Head

By ML Team18 min read
AttentionNLPDeep Learning

Introduction

Attention mechanisms allow neural networks to focus on relevant parts of the input when producing each part of the output. From the original seq2seq attention to modern multi-head attention in transformers, let's explore how these mechanisms evolved.

Interactive Attention Visualizer

Click on words and change attention types to see how different mechanisms work:

Self-Attention

Each word attends to all words in the same sequence, capturing internal relationships.

Evolution of Attention

2014: Seq2Seq Attention

Bahdanau et al. introduce attention for neural machine translation

context = Σ(α_i * h_i)

2015: Luong Attention

Simplified and more efficient attention mechanisms

score = h_t · h_s (dot product)

2017: Transformer Self-Attention

"Attention is All You Need" - revolutionary parallelizable architecture

Attention(Q,K,V) = softmax(QK^T/√d)V

2020: Efficient Attention

Linformer, Performer, and other O(n) attention variants

Reduced complexity from O(n²) to O(n)

Types of Attention Patterns

Global Attention

Every position attends to all positions

Local/Window Attention

Attends only to nearby positions

Dilated Attention

Skips positions for efficiency

Causal Attention

Only attends to previous positions

Key Concepts

Query, Key, Value

The fundamental components of attention mechanisms:

  • Query (Q): What information am I looking for?
  • Key (K): What information is available?
  • Value (V): The actual information content

Attention Score Calculation

1. Compute scores: scores = Q · K^T
2. Scale: scores = scores / √d_k
3. Apply softmax: weights = softmax(scores)
4. Weight values: output = weights · V

Why Multi-Head?

Different heads can capture different types of relationships:

  • • Syntactic relationships (grammar)
  • • Semantic relationships (meaning)
  • • Positional relationships (distance)
  • • Topic relationships (context)

Conclusion

Attention mechanisms have transformed how we build neural networks, enabling models to dynamically focus on relevant information. From machine translation to language models to computer vision, attention has become a fundamental building block of modern AI systems.

MachinaLearning - Machine Learning Education Platform