Subword Tokenization (BPE)
The vocabulary problem
Before a model sees text, the text is split into tokens. The old approach — one token per whole word — has two failures: the vocabulary balloons to millions of entries, and any word it never saw in training (a typo, a name, "unfriendliest") becomes a useless <unknown>. Characters fix coverage but make sequences painfully long.
- Huge vocabulary
- Unseen words →
<UNK> - No sharing between "play" and "playing"
- Small, fixed vocabulary
- Any word built from known pieces — no
<UNK> - "play" + "ing" shares the "play" chunk
Byte-Pair Encoding: merge the most frequent pair
BPE starts with every word split into characters, then repeats one move: find the adjacent pair that occurs most often across the corpus and merge it into a new token. Watch the merges accumulate — then tokenize a word the algorithm never saw.
Why every modern LLM uses this
Worst case, a strange word falls back to characters — but it's always representable. Nothing is ever <UNK>.
Frequent words become a single token; rare words split into a few pieces. Sequences stay short where it matters.
When you hear an LLM has a "128k context" or is billed "per token", this is the token being counted. A rough rule of thumb in English: ~4 characters or ¾ of a word per token. This is the bridge from classic NLP counts like Bag of Words to how LLMs actually read text.
Drive the tokenizer yourself. The slider controls how many merges the vocabulary has learned (we extend the corpus above to 6); the buttons pick a word to tokenize — including words the corpus never contained. Drag 0 → 6 and watch characters fuse into subwords, and the token count fall.
"widest" only ever benefits from the early 'est' merges — its other letters are too rare to earn merges of their own. That's BPE's frequency logic in action: common chunks become single tokens, rare words stay in pieces. Real vocabularies just run this loop ~50,000 times instead of 6.