Regex for NLP
The Swiss-army knife of text
A regular expression is a tiny pattern language for matching text. In NLP it's the workhorse of cleaning and extraction — strip URLs, pull out emails, find hashtags — in a single compact line.
You won't build a classifier with regex, but you'll reach for it constantly in the preprocessing stage. A few patterns cover most quick wins.
Watch patterns match
A messy social-media post gets scanned by three patterns — URL, mention, hashtag — each highlighting its matches.
A starter toolkit
Character classes
\w \d \s
Word char, digit, whitespace. [a-z] = a custom set; . = any char.
Quantifiers
* + ? {n}
Zero-or-more, one-or-more, optional, exactly n. + after a class repeats it.
Anchors & groups
^ $ ( )
Start/end of string; parentheses capture pieces to pull out.
Common patterns
recipes
URL: https?://\S+ · hashtag: #\w+ · mention: @\w+ · email: [\w.-]+@[\w.-]+\.\w+ · digits: \d+.
Use it well
Great for
- Stripping URLs, HTML, emojis
- Extracting emails, phone numbers, dates
- Simple, rule-based tokenization
Don't overreach
- Regex can't parse real grammar or meaning
- Complex patterns get unreadable fast
- Edge cases (every email format!) are endless