Regex for NLP · Suman Bhadra Notes

The Swiss-army knife of text

A regular expression is a tiny pattern language for matching text. In NLP it's the workhorse of cleaning and extraction — strip URLs, pull out emails, find hashtags — in a single compact line.

You won't build a classifier with regex, but you'll reach for it constantly in the preprocessing stage. A few patterns cover most quick wins.

Watch patterns match

A messy social-media post gets scanned by three patterns — URL, mention, hashtag — each highlighting its matches.

A starter toolkit

Character classes \w \d \s

Word char, digit, whitespace. [a-z] = a custom set; . = any char.

Quantifiers * + ? {n}

Zero-or-more, one-or-more, optional, exactly n. + after a class repeats it.

Anchors & groups ^ $ ( )

Start/end of string; parentheses capture pieces to pull out.

Common patterns recipes

URL: https?://\S+ · hashtag: #\w+ · mention: @\w+ · email: [\w.-]+@[\w.-]+\.\w+ · digits: \d+.

Use it well

Great for

Stripping URLs, HTML, emojis
Extracting emails, phone numbers, dates
Simple, rule-based tokenization

Don't overreach

Regex can't parse real grammar or meaning
Complex patterns get unreadable fast
Edge cases (every email format!) are endless

Where it fits

Regex lives in the cleaning step of the pipeline. For meaning-level tasks, hand off to NER and learned models.