Vector Embeddings Fundamentals

From Words to Vectors

Vector embeddings are the foundation of modern NLP and semantic search. They transform discrete symbols (words, sentences, documents) into continuous vector representations where semantic similarity corresponds to geometric proximity.

What Makes a Good Embedding?

  • Semantic Preservation: Similar meanings → nearby vectors
  • Compositionality: Vector operations reflect semantic operations
  • Dimensionality: Balance between expressiveness and efficiency
  • Generalization: Useful representations for downstream tasks

Word2Vec Vector Space

Words are mapped to vectors where semantic relationships become geometric relationships. Click words to select them.

Select word:

Key Observations:

  • • Gender relationships are parallel vectors
  • • Country-capital relationships are consistent
  • • Similar words cluster together
  • • Vector arithmetic captures analogies

Evolution of Embeddings

1. One-Hot Encoding

Sparse vectors with single 1, rest 0s. No semantic information.

2. Word2Vec (2013)

Learn from context windows using skip-gram or CBOW. Captures analogies.

3. GloVe (2014)

Global matrix factorization + local context. Better on word similarity.

4. Contextual Embeddings (2018+)

BERT, GPT: Same word gets different vectors based on context.

5. Sentence/Document Embeddings

SBERT, Universal Sentence Encoder: Embed full texts, not just words.

Semantic Search in Embedding Space

Enter a query to see how embeddings enable semantic search. Documents within the similarity threshold are retrieved.

How it works:

  1. Query is embedded to same space
  2. Calculate similarity to all documents
  3. Retrieve those above threshold
  4. Rank by similarity score

Try these queries:

  • • "neural networks"
  • • "text analysis"
  • • "image recognition"
  • • "database optimization"

Key Properties

Vector Arithmetic

king - man + woman ≈ queen Paris - France + Italy ≈ Rome

Similarity Measures

  • Cosine Similarity: cos(θ) = (a·b)/(||a|| ||b||)
  • Euclidean Distance: ||a - b||₂
  • Dot Product: a·b (for normalized vectors)

Dimensionality

  • Word2Vec: 50-300 dimensions
  • BERT: 768 dimensions
  • GPT-3: 12,288 dimensions
  • Modern models: 384-1536 typical

Similarity Metrics Comparison

Different similarity metrics capture different notions of "closeness" in embedding space. Click on documents to select them.

Compare with:

When to use each metric:

  • Cosine: When magnitude doesn't matter (e.g., document similarity)
  • Euclidean: When absolute position matters (e.g., clustering)
  • Dot Product: Fast, combines magnitude and angle (normalized embeddings)

Creating Embeddings

Training Methods

  • Predictive: Predict context (Word2Vec) or next token (GPT)
  • Contrastive: Pull similar items together, push different apart
  • Masked: Predict masked tokens (BERT)
  • Autoencoding: Compress and reconstruct

Fine-tuning

Start with pre-trained embeddings, then fine-tune on domain-specific data for better performance on specialized tasks.

Applications

  • Semantic Search: Find documents by meaning, not keywords
  • Recommendation: Find similar items in embedding space
  • Clustering: Group similar documents automatically
  • Classification: Use embeddings as features
  • RAG Systems: Retrieve relevant context for LLMs
  • Deduplication: Find near-duplicate content

Why Embeddings Matter

  • Semantic Understanding: Capture meaning, not just surface forms
  • Efficiency: Dense representations enable fast similarity search
  • Transfer Learning: Pre-trained embeddings work across tasks
  • Multimodal: Same space for text, images, audio (CLIP)
  • Foundation: Essential for modern NLP and search systems