Tokenization · Suman Bhadra Notes

Cutting text into pieces

Before anything else, NLP chops text into tokens — the atomic units a model reads. Most often a token is a word, but it can be a sub-word or even a single character.

It sounds trivial ("just split on spaces"), but doing it well is surprisingly subtle: contractions, punctuation, hyphens, URLs, emojis, and languages without spaces all complicate the cut.

First stage of the pipeline

Tokenization is step one of the NLP pipeline. Every later step — cleaning, vectorizing, modeling — operates on these tokens.

See the cuts

Watch a sentence split on whitespace, then handle punctuation and contractions, then see how sub-word tokenization breaks a rare word into known pieces.

Levels of tokenization

Word split on spaces/punct

Intuitive, but the vocabulary explodes and unseen words become "unknown".

Sub-word BPE, WordPiece

Break rare words into frequent pieces (un + happi + ness). Small vocab, no unknowns. Used by modern LLMs.

Character one char each

Tiny vocabulary, never unknown — but sequences get very long and lose word-level meaning.

The tricky cases

Contractions don't → do + n't

Should "don't" be one token or two? Conventions differ.

Punctuation "U.S.A." vs "end."

A period can end a sentence or sit inside an abbreviation.

No spaces Chinese, Japanese

Some languages don't separate words with spaces at all.