Text Cleaning & Noise Removal · Suman Bhadra Notes

Make the same word look the same

To a computer, "Movie", "movie", and "movie!!!" are three different things. Cleaning normalizes surface variation so they all collapse to one token.

Cut the noise and the vocabulary shrinks, counts line up, and the model can focus on signal instead of formatting quirks.

Scrub a messy review

Watch a raw, noisy review get lowercased, stripped of HTML and URLs, cleared of punctuation, and tidied of extra whitespace.

The usual operations

Lowercase Movie → movie

Collapse case so the same word isn't counted twice.

Strip markup <br>, URLs

Remove HTML tags, links, and other boilerplate that carries no meaning.

Drop punctuation !!! , . ?

Remove symbols (for bag-of-words models). Numbers too, if they don't matter.

Normalize whitespace " " → " "

Collapse runs of spaces, tabs, and newlines into single spaces.

Handle accents/emoji café → cafe

Optionally fold accents or convert emojis to words, depending on the task.

Regex power tools patterns

Most of this is done with regular expressions — see Regex for NLP.

Clean carefully — not everything is noise

Usually safe to remove

HTML tags and URLs
Extra whitespace
Case differences (for most tasks)

Think before removing

Punctuation — "!" can signal strong sentiment
Emojis — often carry the sentiment in social text
Negation — "not good" must keep the "not"
For transformers, heavy cleaning often hurts