Tokenization

NLP preprocessing tokens subword

Cutting text into pieces

Before anything else, NLP chops text into tokens — the atomic units a model reads. Most often a token is a word, but it can be a sub-word or even a single character.

It sounds trivial ("just split on spaces"), but doing it well is surprisingly subtle: contractions, punctuation, hyphens, URLs, emojis, and languages without spaces all complicate the cut.

First stage of the pipeline

Tokenization is step one of the NLP pipeline. Every later step — cleaning, vectorizing, modeling — operates on these tokens.

See the cuts

Watch a sentence split on whitespace, then handle punctuation and contractions, then see how sub-word tokenization breaks a rare word into known pieces.

Levels of tokenization

Word split on spaces/punct

Intuitive, but the vocabulary explodes and unseen words become "unknown".

Sub-word BPE, WordPiece

Break rare words into frequent pieces (un + happi + ness). Small vocab, no unknowns. Used by modern LLMs.

Character one char each

Tiny vocabulary, never unknown — but sequences get very long and lose word-level meaning.

The tricky cases

Contractions don't → do + n't

Should "don't" be one token or two? Conventions differ.

Punctuation "U.S.A." vs "end."

A period can end a sentence or sit inside an abbreviation.

No spaces Chinese, Japanese

Some languages don't separate words with spaces at all.

Next

Once you have tokens, the next steps clean and normalize them: Text Cleaning, Stop Words, and Stemming.