Word Embeddings

Introduction

Word embeddings are dense vector representations of words that capture semantic meaning. Unlike one-hot encoding, embeddings place similar words close together in vector space, enabling mathematical operations on meaning.

← Previous: Text Preprocessing → Related: Vector Embeddings in RAG

How Word Embeddings Work

Key Concepts

Distributional Hypothesis: Words with similar contexts have similar meanings
Dense Representations: Typically 50-300 dimensions vs. vocabulary size
Semantic Arithmetic: king - man + woman ≈ queen
Transfer Learning: Pre-trained embeddings can be fine-tuned

Training Methods

Word2Vec: Skip-gram and CBOW models
GloVe: Global Vectors using co-occurrence statistics
FastText: Subword information for morphology
Contextual: ELMo, BERT (position-dependent)

Interactive Word Embedding Visualizer

Embedding Space

Click words to select them. Selected words show connections.

Training Architecture

Visualization of the current training method architecture.

Configuration

Training Method

Embedding Dimensions: 8

Context Window Size: 2

Show vectors from origin

Training

Corpus Statistics

Vocabulary Size: 0 words

Base Sentences: 26

Custom Sentences: 0

Training Examples/Epoch: ~0

Controls

Embedding Source

Using pretrained GloVe embeddings

Vocabulary Stats

Total words: 0

Embedding dim: 8

Source: glove

Training Word2Vec

Skip-gram Model

Predicts context words given center word

• Input: One-hot encoded center word
• Hidden: Word embedding (no activation)
• Output: Softmax over vocabulary
• Objective: Maximize P(context|center)

P(w_c|w_t) = exp(v_c · v_t) / Σ exp(v_i · v_t)

CBOW Model

Predicts center word given context

• Input: Sum/average of context word vectors
• Hidden: Combined embedding representation
• Output: Softmax over vocabulary
• Objective: Maximize P(center|context)

P(w_t|context) = exp(v_t · h) / Σ exp(v_i · h)

Optimization Techniques

Negative Sampling:

Instead of full softmax, sample k negative examples

Hierarchical Softmax:

Binary tree structure reduces computation

Subsampling:

Down-sample frequent words like "the", "a"

Dynamic Window:

Randomly vary context window size

Where Word Embeddings Are Used

Traditional NLP

• Text classification
• Named entity recognition
• Sentiment analysis
• Machine translation

Modern Applications

• Input to transformer models
• RAG system embeddings
• Semantic search
• Recommendation systems

Related Concepts

• Sentence embeddings
• Document embeddings
• Cross-lingual embeddings
• Multimodal embeddings

Word Algebra

Word embeddings enable fascinating arithmetic operations. The classic example "king - man + woman = queen" demonstrates how semantic relationships are encoded in vector space.

Embedding Source:

Build Your Equation:

Classic Examples

king - man + woman→ queen
paris - france + italy→ rome
good - better + bad→ worse

Click to try these examples!

Why It Works

Word embeddings encode semantic relationships as vector offsets. The vector from "man" to "king" represents royalty/leadership, which when applied to "woman" yields "queen".

Add Custom Words

Expand the vocabulary by adding custom words with example contexts. The system will learn embeddings for your words based on their usage.

Word to Add:

Example Context:

💡 Tips for adding words:

Provide multiple example sentences for better embeddings
Use the word in different contexts
Include related words in the context

Key Takeaways

Word embeddings capture semantic relationships in dense vectors
Training uses the distributional hypothesis: similar contexts mean similar words
Word2Vec offers two architectures: Skip-gram (better for rare words) and CBOW (faster)
Embeddings enable semantic arithmetic and analogy tasks
Pre-trained embeddings are foundational for modern NLP and transformers
Quality depends on corpus size, dimension choice, and training parameters