Why Convert Text to Numbers? · Suman Bhadra Notes

Models do math, not reading

Under the hood, every ML model is arithmetic — it multiplies inputs by weights and adds them up. You can't multiply the word "great" by 0.7. So before any modeling, text must become numbers.

This step is called vectorization (or feature extraction): turning each document into a vector of numbers that captures something about its content. It's the bridge from the language world into the math world.

The core problem

Represent a document as a fixed-length list of numbers — in a way that similar documents get similar vectors.

See the bridge

A sentence can't enter a model as-is. The animation shows why, then the basic recipe: build a vocabulary, map words to positions, and emit a number vector.

What makes a good representation?

Fixed length same size

Every document → a vector of the same dimension, whatever its length.

Captures content meaning in numbers

The numbers should reflect what words appear (and ideally, what they mean).

Similar → close geometry

Documents about the same topic should land near each other in vector space.

The methods in this track

One-Hot one slot per word

The naive baseline — and its limits. See One-Hot Encoding.

Bag of Words word counts

Count how often each word appears. See Bag of Words.

TF-IDF weighted counts

Down-weight common words, up-weight distinctive ones. See TF-IDF.

Embeddings dense meaning

Learn vectors where meaning lives in geometry. See Word2Vec.

The progression

Each method fixes a weakness of the last — from sparse one-hot, to counts, to weighted counts, to dense learned meaning.