Embedding Models · Suman Bhadra Notes

From words to whole meanings

Word2Vec gave every word its own vector. An embedding model goes a step further: it gives a whole sentence, paragraph, or document one vector that captures what it means.

That's why "How do I reset my password?" and "Forgot my login credentials" land right next to each other in embedding space — despite sharing no meaningful words. The model has learned that they're asking the same thing, and similarity of meaning becomes closeness in geometry.

Word2Vec era one vector per word

Great for word analogies, but a sentence was just a bag of word vectors — averaging them loses word order and nuance.

Embedding model one vector per text

A transformer encoder reads the whole text — order, negation, context and all — and compresses it into a single fixed-length vector.

The payoff meaning = position

Paraphrases cluster together, unrelated texts drift apart. Comparing meanings becomes comparing points.

In one sentence

An embedding model is a function: text in → one meaning-vector out — and texts that mean similar things come out close together.

How they're trained: contrastive learning

Under the hood sits a transformer encoder (a BERT-style model), but the magic is in the training objective. Instead of predicting the next word, the model is trained on pairs: a question and its answer, two duplicate questions from a forum, a headline and its article. The rule is simple:

Positive pairs pull together

Texts that belong together (question ↔ its answer, two paraphrases) should get embeddings that point the same way — the loss pulls them closer.

Negative pairs push apart

Texts that don't belong together should end up far apart — the loss pushes their embeddings away from each other.

In-batch negatives free negatives

The clever trick: within one training batch, every other example's pair serves as a negative. One batch of 256 pairs gives each question 255 things to push away from.

Repeat over millions of pairs and the space organises itself: meanings that match attract, meanings that don't repel. This recipe was popularised by SBERT / sentence-transformers (2019), and modern embedding models — open-source and API-based alike — are its descendants, trained on far more pairs.

Using them: embed once, compare cheaply

The workflow is always the same. Embed your corpus once and store the vectors — usually in a vector database. At query time, embed the incoming text with the same model and rank stored vectors by cosine similarity. The expensive part (running the encoder over every document) happens once, offline; each search is just fast vector math.

Semantic search & RAG find by meaning

Search that understands intent, and the retrieval step of RAG — fetch the chunks whose embeddings sit closest to the question.

Dedup & clustering group look-alikes

Near-duplicate detection, grouping support tickets by issue, organising documents by topic — all nearest-neighbour problems in embedding space.

Recommendation "more like this"

Embed items and what a user engaged with; recommending becomes "find vectors near the ones they liked".

Watch it happen

The animation walks the full story: sentences pass through the encoder and become points in a 2-D space, contrastive training pulls a matching pair together and pushes a random pair apart, a query finds its neighbours by cosine angle, and finally a cross-encoder reranks the bi-encoder's top-4.

Bi-encoder vs. cross-encoder

When people say "embedding model" they mean a bi-encoder: query and document are embedded separately, then compared. There's a slower sibling — the cross-encoder — that reads both texts together in one pass and outputs a relevance score directly.

Bi-encoder (embedding model)

Embeds query and document independently
Document vectors are precomputed once
Search = cosine over stored vectors — millions of docs in milliseconds
Slightly less accurate: the two texts never "see" each other

Cross-encoder (reranker)

Reads query + document together, attention across both
Nothing can be precomputed — one full forward pass per pair
Far too slow to score a whole corpus
More accurate — catches subtleties the bi-encoder misses

Production systems get the best of both with retrieve-then-rerank: the bi-encoder fetches the top-k candidates fast (say, 50 out of a million), then the cross-encoder carefully rescores just those 50 and reorders them. Cheap recall first, expensive precision second.

Practical notes

Things that bite in practice

Dimensions — typical models output vectors of roughly 384 to 3000 dimensions. Bigger isn't automatically better: it costs storage, memory and search speed.
Same model on both sides — the query and the documents must be embedded by the same model (and version). Mixing models puts vectors in incompatible spaces, and the distances become nonsense.
Chunking matters — embedding models have their own input limit, and one vector can only hold so much meaning. Long documents are split into chunks before embedding — see RAG and context windows for why size and placement matter.
Embeddings ≠ keywords — vectors can miss exact terms like product codes, names, or rare jargon. Hybrid search (BM25 keyword scoring + vector similarity) often beats either alone.
Choosing a model — MTEB (Massive Text Embedding Benchmark) is the standard leaderboard for comparing embedding models across retrieval, clustering and classification tasks.