Stop Words
The glue words
Stop words are the most common words in a language — the, is, a, and, of — the grammatical glue that holds sentences together but often carries little topical meaning.
In a bag-of-words model that ignores order, these words appear everywhere and in every class, so they add bulk without helping the model tell documents apart. Removing them can shrink the vocabulary and sharpen the signal.
Watch them get filtered
A sentence with its stop words highlighted and stripped away — notice the content words that remain still tell you what it's about.
Why remove them?
Fewer dimensions in the feature space → faster, leaner models.
What remains is mostly nouns and verbs that distinguish documents.
Stops common words from dominating raw frequency counts.
When NOT to remove them
Stop-word lists are blunt. "To be or not to be" is entirely stop words — strip them and the phrase vanishes. Removing "not" can flip sentiment: "not good" → "good".
- Bag-of-words / TF-IDF topic models
- Keyword extraction & search indexing
- Tasks where order doesn't matter
- Sentiment (negation words matter)
- Translation & generation (grammar matters)
- Transformers / LLMs — they use every word's context
There's no single official stop list — NLTK, spaCy, and scikit-learn each ship a slightly different one. Tune it to your task, and consider a custom list (e.g. domain words that appear in every document).