Contextual Embeddings

Introduction

Unlike static word embeddings (Word2Vec, GloVe), contextual embeddings generate different representations for the same word based on its surrounding context. This revolutionary approach powers modern NLP models like BERT and GPT.

← Previous: Word Embeddings → Related: Transformers → Next: BERT & GPT

Static vs Contextual Embeddings

Static embeddings assign the same vector regardless of context, while contextual embeddings adapt based on surrounding words.

Interactive Contextual Embedding Visualizer

Embedding Type:

Show Attention

Token Embeddings

Heatmap showing embedding values for each token. Blue = positive, Red = negative.

Self-Attention Weights

Attention weights showing how much the target word attends to other words.

Evolution of Contextual Embeddings

ELMo (2018)

• Architecture: Bidirectional LSTMs
• Key Innovation: Character-level CNN + bi-LSTM layers
• Context: Combines forward and backward language models
• Usage: Feature extraction, concatenated with task-specific models

BERT (2018)

• Architecture: Transformer encoder (self-attention)
• Key Innovation: Masked language modeling + next sentence prediction
• Context: Bidirectional attention over entire sequence
• Usage: Fine-tuning entire model for downstream tasks

GPT Series (2018-2023)

• Architecture: Transformer decoder (causal attention)
• Key Innovation: Autoregressive language modeling at scale
• Context: Left-to-right attention only
• Usage: Few-shot learning, prompt engineering

How Contextual Embeddings Work

Position-aware Processing

Unlike static embeddings, contextual models process the entire sequence:

Tokenize input text
Add positional encodings/embeddings
Process through multiple layers
Each layer refines representations based on context

Attention Mechanisms

Self-attention allows each token to "look at" all other tokens:

Attention(Q,K,V) = softmax(QK^T/√d)V

This creates context-aware representations at each position.

Applications and Impact

Disambiguation

Correctly understanding polysemous words (bank, bat, bear) based on context.

Transfer Learning

Pre-trained models fine-tuned for specific tasks with minimal data.

Few-shot Learning

Models like GPT can adapt to new tasks with just a few examples.

Key Takeaways

Contextual embeddings revolutionized NLP by solving the polysemy problem
Each word gets a unique embedding based on its specific context
ELMo used LSTMs, while BERT and GPT use transformer architectures
Self-attention mechanisms are key to capturing long-range dependencies
Pre-training on large corpora enables effective transfer learning
These embeddings form the foundation of modern LLMs and NLP systems

Next Steps

Transformers Architecture →

Deep dive into attention mechanisms

BERT & GPT Models →

Explore specific architectures

Language Modeling →

Understand pre-training objectives