Vector Embeddings Fundamentals
From Words to Vectors
Vector embeddings are the foundation of modern NLP and semantic search. They transform discrete symbols (words, sentences, documents) into continuous vector representations where semantic similarity corresponds to geometric proximity.
What Makes a Good Embedding?
- Semantic Preservation: Similar meanings → nearby vectors
- Compositionality: Vector operations reflect semantic operations
- Dimensionality: Balance between expressiveness and efficiency
- Generalization: Useful representations for downstream tasks
Word2Vec Vector Space
Words are mapped to vectors where semantic relationships become geometric relationships. Click words to select them.
Key Observations:
- • Gender relationships are parallel vectors
- • Country-capital relationships are consistent
- • Similar words cluster together
- • Vector arithmetic captures analogies
Evolution of Embeddings
1. One-Hot Encoding
Sparse vectors with single 1, rest 0s. No semantic information.
2. Word2Vec (2013)
Learn from context windows using skip-gram or CBOW. Captures analogies.
3. GloVe (2014)
Global matrix factorization + local context. Better on word similarity.
4. Contextual Embeddings (2018+)
BERT, GPT: Same word gets different vectors based on context.
5. Sentence/Document Embeddings
SBERT, Universal Sentence Encoder: Embed full texts, not just words.
Semantic Search in Embedding Space
Enter a query to see how embeddings enable semantic search. Documents within the similarity threshold are retrieved.
How it works:
- Query is embedded to same space
- Calculate similarity to all documents
- Retrieve those above threshold
- Rank by similarity score
Try these queries:
- • "neural networks"
- • "text analysis"
- • "image recognition"
- • "database optimization"
Key Properties
Vector Arithmetic
king - man + woman ≈ queen Paris - France + Italy ≈ Rome
Similarity Measures
- Cosine Similarity: cos(θ) = (a·b)/(||a|| ||b||)
- Euclidean Distance: ||a - b||₂
- Dot Product: a·b (for normalized vectors)
Dimensionality
- Word2Vec: 50-300 dimensions
- BERT: 768 dimensions
- GPT-3: 12,288 dimensions
- Modern models: 384-1536 typical
Similarity Metrics Comparison
Different similarity metrics capture different notions of "closeness" in embedding space. Click on documents to select them.
When to use each metric:
- • Cosine: When magnitude doesn't matter (e.g., document similarity)
- • Euclidean: When absolute position matters (e.g., clustering)
- • Dot Product: Fast, combines magnitude and angle (normalized embeddings)
Creating Embeddings
Training Methods
- Predictive: Predict context (Word2Vec) or next token (GPT)
- Contrastive: Pull similar items together, push different apart
- Masked: Predict masked tokens (BERT)
- Autoencoding: Compress and reconstruct
Fine-tuning
Start with pre-trained embeddings, then fine-tune on domain-specific data for better performance on specialized tasks.
Applications
- Semantic Search: Find documents by meaning, not keywords
- Recommendation: Find similar items in embedding space
- Clustering: Group similar documents automatically
- Classification: Use embeddings as features
- RAG Systems: Retrieve relevant context for LLMs
- Deduplication: Find near-duplicate content
Why Embeddings Matter
- Semantic Understanding: Capture meaning, not just surface forms
- Efficiency: Dense representations enable fast similarity search
- Transfer Learning: Pre-trained embeddings work across tasks
- Multimodal: Same space for text, images, audio (CLIP)
- Foundation: Essential for modern NLP and search systems