One-Hot Encoding for Text

NLP vectorization sparse baseline

One slot per word

One-hot encoding gives every word in the vocabulary its own position in a long vector. A word is represented by a 1 in its slot and 0 everywhere else.

If the vocabulary has 10,000 words, every word is a 10,000-long vector with a single 1. Simple, unambiguous — and, as we'll see, deeply wasteful.

See it, and see its flaw

Words become one-hot vectors, a document stacks them up (very sparse), and then the killer limitation: every word is equally far from every other.

The limitations

Huge & sparse vocab-sized

Vectors as long as the vocabulary (tens of thousands), almost entirely zeros. Wasteful in memory and compute.

No similarity all orthogonal

"cat" and "dog" are exactly as different as "cat" and "democracy". The encoding knows nothing about meaning.

No order slots are arbitrary

Position in the vector is just an index — it carries no relationship between words.

Out-of-vocabulary unseen = nothing

A word not in the vocabulary has no slot at all.

So why learn it?

One-hot is the conceptual foundation everything else builds on. Bag of Words is essentially summing one-hots into counts; word embeddings were invented precisely to fix the "no similarity" flaw by giving words dense vectors where related words sit close together.

The fix, previewed

Embeddings replace a 10,000-long sparse one-hot with, say, a 300-long dense vector — small, and arranged so that meaning lives in geometry.