MachinaLearning

Introduction

Bag of Words is one of the simplest text representation methods. It represents text as a vector where each dimension corresponds to a word in the vocabulary, and the value is the word's frequency (or TF-IDF score) in the document.

← Previous: Word Embeddings → Next: TF-IDF

How Bag of Words Works

1. Build Vocabulary

Extract all unique words from the corpus and assign each an index.

vocab = {'the': 0, 'cat': 1, 'sat': 2, ...}

2. Count Words

For each document, count how many times each vocabulary word appears.

'the cat sat' → [2, 1, 1, 0, ...]

3. Create Vector

The resulting vector represents the document in high-dimensional space.

doc_vector = [tf1, tf2, ..., tfn]

Interactive BoW Visualizer

Configuration

N-gram Size: 1

Min Word Count: 1

Use TF-IDF weightingNormalize vectors

Sample Documents

Add Custom Document

BoW Vector Representation

Document Similarity

Cosine similarity between document vectors. Brighter = more similar.

Advantages and Limitations

✓ Advantages

• Simple and intuitive to understand
• Fast to compute and implement
• Works well for document classification
• No training required
• Interpretable features (word counts)

✗ Limitations

• Ignores word order and grammar
• High-dimensional sparse vectors
• No semantic understanding
• Vocabulary can be very large
• Treats "not good" same as "good not"

Variations and Extensions

Binary BoW

Uses 1/0 to indicate word presence instead of counts. Useful when frequency isn't important.

TF-IDF

Weights words by their importance across documents. Reduces impact of common words.

N-grams

Captures sequences of N words. Bigrams (N=2) can capture phrases like "not good".

Key Takeaways

BoW represents documents as word frequency vectors
Simple but effective for many text classification tasks
Loses word order and context information
TF-IDF weighting improves performance by considering word importance
Forms the basis for more advanced embedding techniques
Still used in production systems where interpretability matters

Next Steps

TF-IDF →

Learn about term frequency weighting

Dense Embeddings →

Explore Word2Vec and GloVe

Text Classification →

Apply BoW to classification

Bag of Words (BoW) Embeddings