Bag of Words (BoW) Embeddings
Introduction
Bag of Words is one of the simplest text representation methods. It represents text as a vector where each dimension corresponds to a word in the vocabulary, and the value is the word's frequency (or TF-IDF score) in the document.
How Bag of Words Works
1. Build Vocabulary
Extract all unique words from the corpus and assign each an index.
vocab = {'the': 0, 'cat': 1, 'sat': 2, ...}
2. Count Words
For each document, count how many times each vocabulary word appears.
'the cat sat' → [2, 1, 1, 0, ...]
3. Create Vector
The resulting vector represents the document in high-dimensional space.
doc_vector = [tf1, tf2, ..., tfn]
Interactive BoW Visualizer
Add Custom Document
BoW Vector Representation
Document Similarity
Cosine similarity between document vectors. Brighter = more similar.
Advantages and Limitations
✓ Advantages
- • Simple and intuitive to understand
- • Fast to compute and implement
- • Works well for document classification
- • No training required
- • Interpretable features (word counts)
✗ Limitations
- • Ignores word order and grammar
- • High-dimensional sparse vectors
- • No semantic understanding
- • Vocabulary can be very large
- • Treats "not good" same as "good not"
Variations and Extensions
Binary BoW
Uses 1/0 to indicate word presence instead of counts. Useful when frequency isn't important.
TF-IDF
Weights words by their importance across documents. Reduces impact of common words.
N-grams
Captures sequences of N words. Bigrams (N=2) can capture phrases like "not good".
Key Takeaways
- BoW represents documents as word frequency vectors
- Simple but effective for many text classification tasks
- Loses word order and context information
- TF-IDF weighting improves performance by considering word importance
- Forms the basis for more advanced embedding techniques
- Still used in production systems where interpretability matters