Bag of Words (BoW)
Just count the words
Bag of Words throws every document's words into a "bag", forgets the order, and represents the document by how many times each vocabulary word appears.
The name says it all: a bag has no order. "The dog bit the man" and "the man bit the dog" produce the same vector. You lose sequence, but you keep a simple, surprisingly effective signal: which words, and how often.
Build the vectors
Two short reviews → a shared vocabulary → a count vector per document. Watch each word drop into its slot.
How it's built
Collect every distinct word across the whole corpus — that's the column set.
For each document, tally how many times each vocab word appears.
Each document becomes a fixed-length count vector — ready for a model.
Those vectors feed straight into a classifier like Naive Bayes or Logistic Regression for spam or sentiment detection.
Strengths and weaknesses
- Simple and fast to compute
- A strong baseline for text classification
- Easy to interpret — counts are transparent
- Ignores word order and grammar
- Sparse, high-dimensional vectors
- Common words dominate the counts
- No sense of word meaning or similarity
N-grams bring back a little order; TF-IDF tames dominant words; embeddings add meaning.