Vector Embeddings Fundamentals

From Words to Vectors

Vector embeddings are the foundation of modern NLP and semantic search. They transform discrete symbols (words, sentences, documents) into continuous vector representations where semantic similarity corresponds to geometric proximity.

What Makes a Good Embedding?

Semantic Preservation: Similar meanings → nearby vectors
Compositionality: Vector operations reflect semantic operations
Dimensionality: Balance between expressiveness and efficiency
Generalization: Useful representations for downstream tasks

Word2Vec Vector Space

Words are mapped to vectors where semantic relationships become geometric relationships. Click words to select them.

Select word:

Key Observations:

• Gender relationships are parallel vectors
• Country-capital relationships are consistent
• Similar words cluster together
• Vector arithmetic captures analogies

Evolution of Embeddings

1. One-Hot Encoding

Sparse vectors with single 1, rest 0s. No semantic information.

2. Word2Vec (2013)

Learn from context windows using skip-gram or CBOW. Captures analogies.

3. GloVe (2014)

Global matrix factorization + local context. Better on word similarity.

4. Contextual Embeddings (2018+)

BERT, GPT: Same word gets different vectors based on context.

5. Sentence/Document Embeddings

SBERT, Universal Sentence Encoder: Embed full texts, not just words.

Semantic Search in Embedding Space

Enter a query to see how embeddings enable semantic search. Documents within the similarity threshold are retrieved.

Query:

Similarity Threshold: 0.70

How it works:

Query is embedded to same space
Calculate similarity to all documents
Retrieve those above threshold
Rank by similarity score

Try these queries:

• "neural networks"
• "text analysis"
• "image recognition"
• "database optimization"

Key Properties

Vector Arithmetic

king - man + woman ≈ queen Paris - France + Italy ≈ Rome

Similarity Measures

Cosine Similarity: cos(θ) = (a·b)/(||a|| ||b||)
Euclidean Distance: ||a - b||₂
Dot Product: a·b (for normalized vectors)

Dimensionality

Word2Vec: 50-300 dimensions
BERT: 768 dimensions
GPT-3: 12,288 dimensions
Modern models: 384-1536 typical

Similarity Metrics Comparison

Different similarity metrics capture different notions of "closeness" in embedding space. Click on documents to select them.

Compare with:

When to use each metric:

• Cosine: When magnitude doesn't matter (e.g., document similarity)
• Euclidean: When absolute position matters (e.g., clustering)
• Dot Product: Fast, combines magnitude and angle (normalized embeddings)

Creating Embeddings

Training Methods

Predictive: Predict context (Word2Vec) or next token (GPT)
Contrastive: Pull similar items together, push different apart
Masked: Predict masked tokens (BERT)
Autoencoding: Compress and reconstruct

Fine-tuning

Start with pre-trained embeddings, then fine-tune on domain-specific data for better performance on specialized tasks.

Applications

Semantic Search: Find documents by meaning, not keywords
Recommendation: Find similar items in embedding space
Clustering: Group similar documents automatically
Classification: Use embeddings as features
RAG Systems: Retrieve relevant context for LLMs
Deduplication: Find near-duplicate content

Why Embeddings Matter

Semantic Understanding: Capture meaning, not just surface forms
Efficiency: Dense representations enable fast similarity search
Transfer Learning: Pre-trained embeddings work across tasks
Multimodal: Same space for text, images, audio (CLIP)
Foundation: Essential for modern NLP and search systems