Contextual Embeddings

Introduction

Unlike static word embeddings (Word2Vec, GloVe), contextual embeddings generate different representations for the same word based on its surrounding context. This revolutionary approach powers modern NLP models like BERT and GPT.

Static vs Contextual Embeddings

Static embeddings assign the same vector regardless of context, while contextual embeddings adapt based on surrounding words.

Interactive Contextual Embedding Visualizer

Token Embeddings

Heatmap showing embedding values for each token. Blue = positive, Red = negative.

Self-Attention Weights

Attention weights showing how much the target word attends to other words.

Evolution of Contextual Embeddings

ELMo (2018)

  • Architecture: Bidirectional LSTMs
  • Key Innovation: Character-level CNN + bi-LSTM layers
  • Context: Combines forward and backward language models
  • Usage: Feature extraction, concatenated with task-specific models

BERT (2018)

  • Architecture: Transformer encoder (self-attention)
  • Key Innovation: Masked language modeling + next sentence prediction
  • Context: Bidirectional attention over entire sequence
  • Usage: Fine-tuning entire model for downstream tasks

GPT Series (2018-2023)

  • Architecture: Transformer decoder (causal attention)
  • Key Innovation: Autoregressive language modeling at scale
  • Context: Left-to-right attention only
  • Usage: Few-shot learning, prompt engineering

How Contextual Embeddings Work

Position-aware Processing

Unlike static embeddings, contextual models process the entire sequence:

  1. Tokenize input text
  2. Add positional encodings/embeddings
  3. Process through multiple layers
  4. Each layer refines representations based on context

Attention Mechanisms

Self-attention allows each token to "look at" all other tokens:

Attention(Q,K,V) = softmax(QK^T/√d)V

This creates context-aware representations at each position.

Applications and Impact

Disambiguation

Correctly understanding polysemous words (bank, bat, bear) based on context.

Transfer Learning

Pre-trained models fine-tuned for specific tasks with minimal data.

Few-shot Learning

Models like GPT can adapt to new tasks with just a few examples.

Key Takeaways

  • Contextual embeddings revolutionized NLP by solving the polysemy problem
  • Each word gets a unique embedding based on its specific context
  • ELMo used LSTMs, while BERT and GPT use transformer architectures
  • Self-attention mechanisms are key to capturing long-range dependencies
  • Pre-training on large corpora enables effective transfer learning
  • These embeddings form the foundation of modern LLMs and NLP systems

Next Steps