Subword Tokenization (BPE) · Suman Bhadra Notes

The vocabulary problem

Before a model sees text, the text is split into tokens. The old approach — one token per whole word — has two failures: the vocabulary balloons to millions of entries, and any word it never saw in training (a typo, a name, "unfriendliest") becomes a useless <unknown>. Characters fix coverage but make sequences painfully long.

Word-level

Huge vocabulary
Unseen words → <UNK>
No sharing between "play" and "playing"

Subword (BPE)

Small, fixed vocabulary
Any word built from known pieces — no <UNK>
"play" + "ing" shares the "play" chunk

Byte-Pair Encoding: merge the most frequent pair

BPE starts with every word split into characters, then repeats one move: find the adjacent pair that occurs most often across the corpus and merge it into a new token. A ▁ marks each word's end and stays a fixed boundary — pairs are only counted within a word, never across it. Watch the merges accumulate — then tokenize a word the algorithm never saw.

Why every modern LLM uses this

No unknowns full coverage

Worst case, a strange word falls back to characters — but it's always representable. Nothing is ever <UNK>.

Right granularity common = whole

Frequent words become a single token; rare words split into a few pieces. Sequences stay short where it matters.

Relatives WordPiece, Unigram

BERT uses WordPiece, GPT uses byte-level BPE — all variations on this merge-or-score idea.

Tokens, not words, are the unit

When you hear an LLM has a "128k context" or is billed "per token", this is the token being counted. A rough rule of thumb in English: ~4 characters or ¾ of a word per token. This is the bridge from classic NLP counts like Bag of Words to how LLMs actually read text.

Drive the tokenizer yourself. The slider controls how many merges the vocabulary has learned (we extend the merge list learned from the corpus above from 4 to 6); the buttons pick a word to tokenize — including words the corpus never contained. Drag 0 → 6 and watch characters fuse into subwords, and the token count fall.

merges applied

"widest" only ever benefits from the early 'est' merges — its other letters are too rare to earn merges of their own. That's BPE's frequency logic in action: common chunks become single tokens, rare words stay in pieces. Real vocabularies just run this loop ~50,000 times instead of 6.