Bag of Words (BoW)

NLP vectorization word counts BoW

Just count the words

Bag of Words throws every document's words into a "bag", forgets the order, and represents the document by how many times each vocabulary word appears.

The name says it all: a bag has no order. "The dog bit the man" and "the man bit the dog" produce the same vector. You lose sequence, but you keep a simple, surprisingly effective signal: which words, and how often.

Build the vectors

Two short reviews → a shared vocabulary → a count vector per document. Watch each word drop into its slot.

How it's built

1. Vocabulary all unique words

Collect every distinct word across the whole corpus — that's the column set.

2. Count per document

For each document, tally how many times each vocab word appears.

3. Vector one row each

Each document becomes a fixed-length count vector — ready for a model.

Then model it

Those vectors feed straight into a classifier like Naive Bayes or Logistic Regression for spam or sentiment detection.

Strengths and weaknesses

Strengths
  • Simple and fast to compute
  • A strong baseline for text classification
  • Easy to interpret — counts are transparent
Weaknesses
  • Ignores word order and grammar
  • Sparse, high-dimensional vectors
  • Common words dominate the counts
  • No sense of word meaning or similarity
The fixes

N-grams bring back a little order; TF-IDF tames dominant words; embeddings add meaning.