Embedding Models
From words to whole meanings
Word2Vec gave every word its own vector. An embedding model goes a step further: it gives a whole sentence, paragraph, or document one vector that captures what it means.
That's why "How do I reset my password?" and "Forgot my login credentials" land right next to each other in embedding space — despite sharing zero words. The model has learned that they're asking the same thing, and similarity of meaning becomes closeness in geometry.
Great for word analogies, but a sentence was just a bag of word vectors — averaging them loses word order and nuance.
A transformer encoder reads the whole text — order, negation, context and all — and compresses it into a single fixed-length vector.
Paraphrases cluster together, unrelated texts drift apart. Comparing meanings becomes comparing points.
An embedding model is a function: text in → one meaning-vector out — and texts that mean similar things come out close together.
How they're trained: contrastive learning
Under the hood sits a transformer encoder (a BERT-style model), but the magic is in the training objective. Instead of predicting the next word, the model is trained on pairs: a question and its answer, two duplicate questions from a forum, a headline and its article. The rule is simple:
Texts that belong together (question ↔ its answer, two paraphrases) should get embeddings that point the same way — the loss pulls them closer.
Texts that don't belong together should end up far apart — the loss pushes their embeddings away from each other.
The clever trick: within one training batch, every other example's pair serves as a negative. One batch of 256 pairs gives each question 255 things to push away from.
Repeat over millions of pairs and the space organises itself: meanings that match attract, meanings that don't repel. This recipe was popularised by SBERT / sentence-transformers (2019), and modern embedding models — open-source and API-based alike — are its descendants, trained on far more pairs.
Using them: embed once, compare cheaply
The workflow is always the same. Embed your corpus once and store the vectors — usually in a vector database. At query time, embed the incoming text with the same model and rank stored vectors by cosine similarity. The expensive part (running the encoder over every document) happens once, offline; each search is just fast vector math.
Search that understands intent, and the retrieval step of RAG — fetch the chunks whose embeddings sit closest to the question.
Near-duplicate detection, grouping support tickets by issue, organising documents by topic — all nearest-neighbour problems in embedding space.
Embed items and what a user engaged with; recommending becomes "find vectors near the ones they liked".
Watch it happen
The animation walks the full story: sentences pass through the encoder and become points in a 2-D space, contrastive training pulls a matching pair together and pushes a random pair apart, a query finds its neighbours by cosine angle, and finally a cross-encoder reranks the bi-encoder's top-4.
Bi-encoder vs. cross-encoder
When people say "embedding model" they mean a bi-encoder: query and document are embedded separately, then compared. There's a slower sibling — the cross-encoder — that reads both texts together in one pass and outputs a relevance score directly.
- Embeds query and document independently
- Document vectors are precomputed once
- Search = cosine over stored vectors — millions of docs in milliseconds
- Slightly less accurate: the two texts never "see" each other
- Reads query + document together, attention across both
- Nothing can be precomputed — one full forward pass per pair
- Far too slow to score a whole corpus
- More accurate — catches subtleties the bi-encoder misses
Production systems get the best of both with retrieve-then-rerank: the bi-encoder fetches the top-k candidates fast (say, 50 out of a million), then the cross-encoder carefully rescores just those 50 and reorders them. Cheap recall first, expensive precision second.
Practical notes
- Dimensions — typical models output vectors of roughly 384 to 3000 dimensions. Bigger isn't automatically better: it costs storage, memory and search speed.
- Same model on both sides — the query and the documents must be embedded by the same model (and version). Mixing models puts vectors in incompatible spaces, and the distances become nonsense.
- Chunking matters — embedding models have their own input limit, and one vector can only hold so much meaning. Long documents are split into chunks before embedding — see RAG and context windows for why size and placement matter.
- Embeddings ≠ keywords — vectors can miss exact terms like product codes, names, or rare jargon. Hybrid search (BM25 keyword scoring + vector similarity) often beats either alone.
- Choosing a model — MTEB (Massive Text Embedding Benchmark) is the standard leaderboard for comparing embedding models across retrieval, clustering and classification tasks.