Word Embeddings

Introduction

Word embeddings are dense vector representations of words that capture semantic meaning. Unlike one-hot encoding, embeddings place similar words close together in vector space, enabling mathematical operations on meaning.

How Word Embeddings Work

Key Concepts

  • Distributional Hypothesis: Words with similar contexts have similar meanings
  • Dense Representations: Typically 50-300 dimensions vs. vocabulary size
  • Semantic Arithmetic: king - man + woman ≈ queen
  • Transfer Learning: Pre-trained embeddings can be fine-tuned

Training Methods

  • Word2Vec: Skip-gram and CBOW models
  • GloVe: Global Vectors using co-occurrence statistics
  • FastText: Subword information for morphology
  • Contextual: ELMo, BERT (position-dependent)

Interactive Word Embedding Visualizer

Embedding Space

Click words to select them. Selected words show connections.

Training Architecture

Visualization of the current training method architecture.

Configuration

Training

Corpus Statistics

Vocabulary Size: 0 words
Base Sentences: 26
Custom Sentences: 0
Training Examples/Epoch: ~0

Controls

Using pretrained GloVe embeddings

Vocabulary Stats

Total words: 0
Embedding dim: 8
Source: glove

Training Word2Vec

Skip-gram Model

Predicts context words given center word

  • • Input: One-hot encoded center word
  • • Hidden: Word embedding (no activation)
  • • Output: Softmax over vocabulary
  • • Objective: Maximize P(context|center)
P(w_c|w_t) = exp(v_c · v_t) / Σ exp(v_i · v_t)

CBOW Model

Predicts center word given context

  • • Input: Sum/average of context word vectors
  • • Hidden: Combined embedding representation
  • • Output: Softmax over vocabulary
  • • Objective: Maximize P(center|context)
P(w_t|context) = exp(v_t · h) / Σ exp(v_i · h)

Optimization Techniques

Negative Sampling:

Instead of full softmax, sample k negative examples

Hierarchical Softmax:

Binary tree structure reduces computation

Subsampling:

Down-sample frequent words like "the", "a"

Dynamic Window:

Randomly vary context window size

Where Word Embeddings Are Used

Traditional NLP

  • • Text classification
  • • Named entity recognition
  • • Sentiment analysis
  • • Machine translation

Modern Applications

  • • Input to transformer models
  • • RAG system embeddings
  • • Semantic search
  • • Recommendation systems

Related Concepts

  • • Sentence embeddings
  • • Document embeddings
  • • Cross-lingual embeddings
  • • Multimodal embeddings

Word Algebra

Word embeddings enable fascinating arithmetic operations. The classic example "king - man + woman = queen" demonstrates how semantic relationships are encoded in vector space.

Build Your Equation:

Classic Examples

  • king - man + woman→ queen
  • paris - france + italy→ rome
  • good - better + bad→ worse

Click to try these examples!

Why It Works

Word embeddings encode semantic relationships as vector offsets. The vector from "man" to "king" represents royalty/leadership, which when applied to "woman" yields "queen".

Add Custom Words

Expand the vocabulary by adding custom words with example contexts. The system will learn embeddings for your words based on their usage.

💡 Tips for adding words:

  • Provide multiple example sentences for better embeddings
  • Use the word in different contexts
  • Include related words in the context

Key Takeaways

  • Word embeddings capture semantic relationships in dense vectors
  • Training uses the distributional hypothesis: similar contexts mean similar words
  • Word2Vec offers two architectures: Skip-gram (better for rare words) and CBOW (faster)
  • Embeddings enable semantic arithmetic and analogy tasks
  • Pre-trained embeddings are foundational for modern NLP and transformers
  • Quality depends on corpus size, dimension choice, and training parameters

Next Steps