Stop Words · Suman Bhadra Notes

The glue words

Stop words are the most common words in a language — the, is, a, and, of — the grammatical glue that holds sentences together but often carries little topical meaning.

In a bag-of-words model that ignores order, these words appear everywhere and in every class, so they add bulk without helping the model tell documents apart. Removing them can shrink the vocabulary and sharpen the signal.

Watch them get filtered

A sentence with its stop words highlighted and stripped away — notice the content words that remain still tell you what it's about.

Why remove them?

Smaller vocabulary less noise

Fewer dimensions in the feature space → faster, leaner models.

Focus on content topic words

What remains is mostly nouns and verbs that distinguish documents.

Better counts for BoW/TF-IDF

Stops common words from dominating raw frequency counts.

When NOT to remove them

The danger

Stop-word lists are blunt. "To be or not to be" is entirely stop words — strip them and the phrase vanishes. Removing "not" can flip sentiment: "not good" → "good".

Remove for

Bag-of-words / TF-IDF topic models
Keyword extraction & search indexing
Tasks where order doesn't matter

Keep for

Sentiment (negation words matter)
Translation & generation (grammar matters)
Transformers / LLMs — they use every word's context

Note

There's no single official stop list — NLTK, spaCy, and scikit-learn each ship a slightly different one. Tune it to your task, and consider a custom list (e.g. domain words that appear in every document).