Tokenization
Cutting text into pieces
Before anything else, NLP chops text into tokens — the atomic units a model reads. Most often a token is a word, but it can be a sub-word or even a single character.
It sounds trivial ("just split on spaces"), but doing it well is surprisingly subtle: contractions, punctuation, hyphens, URLs, emojis, and languages without spaces all complicate the cut.
Tokenization is step one of the NLP pipeline. Every later step — cleaning, vectorizing, modeling — operates on these tokens.
See the cuts
Watch a sentence split on whitespace, then handle punctuation and contractions, then see how sub-word tokenization breaks a rare word into known pieces.
Levels of tokenization
Intuitive, but the vocabulary explodes and unseen words become "unknown".
Break rare words into frequent pieces (un + happi + ness). Small vocab, no unknowns. Used by modern LLMs.
Tiny vocabulary, never unknown — but sequences get very long and lose word-level meaning.
The tricky cases
Should "don't" be one token or two? Conventions differ.
A period can end a sentence or sit inside an abbreviation.
Some languages don't separate words with spaces at all.
Once you have tokens, the next steps clean and normalize them: Text Cleaning, Stop Words, and Stemming.