Bag of Words (BoW) · Suman Bhadra Notes

Just count the words

Bag of Words throws every document's words into a "bag", forgets the order, and represents the document by how many times each vocabulary word appears.

The name says it all: a bag has no order. "The dog bit the man" and "the man bit the dog" produce the same vector. You lose sequence, but you keep a simple, surprisingly effective signal: which words, and how often.

Build the vectors

Two short reviews → a shared vocabulary → a count vector per document. Watch each word drop into its slot.

How it's built

1. Vocabulary all unique words

Collect every distinct word across the whole corpus — that's the column set.

2. Count per document

For each document, tally how many times each vocab word appears.

3. Vector one row each

Each document becomes a fixed-length count vector — ready for a model.

Then model it

Those vectors feed straight into a classifier like Naive Bayes or Logistic Regression for spam or sentiment detection.

Strengths and weaknesses

Strengths

Simple and fast to compute
A strong baseline for text classification
Easy to interpret — counts are transparent

Weaknesses

Ignores word order and grammar
Sparse, high-dimensional vectors
Common words dominate the counts
No sense of word meaning or similarity

The fixes

N-grams bring back a little order; TF-IDF tames dominant words; embeddings add meaning.