Bag of Words (BoW) Embeddings

Introduction

Bag of Words is one of the simplest text representation methods. It represents text as a vector where each dimension corresponds to a word in the vocabulary, and the value is the word's frequency (or TF-IDF score) in the document.

How Bag of Words Works

1. Build Vocabulary

Extract all unique words from the corpus and assign each an index.

vocab = {'the': 0, 'cat': 1, 'sat': 2, ...}

2. Count Words

For each document, count how many times each vocabulary word appears.

'the cat sat' → [2, 1, 1, 0, ...]

3. Create Vector

The resulting vector represents the document in high-dimensional space.

doc_vector = [tf1, tf2, ..., tfn]

Interactive BoW Visualizer

Add Custom Document

BoW Vector Representation

Document Similarity

Cosine similarity between document vectors. Brighter = more similar.

Advantages and Limitations

✓ Advantages

  • • Simple and intuitive to understand
  • • Fast to compute and implement
  • • Works well for document classification
  • • No training required
  • • Interpretable features (word counts)

✗ Limitations

  • • Ignores word order and grammar
  • • High-dimensional sparse vectors
  • • No semantic understanding
  • • Vocabulary can be very large
  • • Treats "not good" same as "good not"

Variations and Extensions

Binary BoW

Uses 1/0 to indicate word presence instead of counts. Useful when frequency isn't important.

TF-IDF

Weights words by their importance across documents. Reduces impact of common words.

N-grams

Captures sequences of N words. Bigrams (N=2) can capture phrases like "not good".

Key Takeaways

  • BoW represents documents as word frequency vectors
  • Simple but effective for many text classification tasks
  • Loses word order and context information
  • TF-IDF weighting improves performance by considering word importance
  • Forms the basis for more advanced embedding techniques
  • Still used in production systems where interpretability matters

Next Steps