Mini Project: Finding Similar Words & Documents
The payoff of embeddings
Now combine the pieces: represent everything as vectors, then rank by cosine similarity to find the nearest neighbours of any query. That's semantic search — and it works on words and whole documents.
It's the same engine behind "related articles", "customers also bought", and the retrieval step in RAG.
Watch a query find its neighbours
Pick the word "happy", score it against every other word's vector, and surface the closest matches.
The recipe
Map each word/document to a vector — Word2Vec, GloVe, or a sentence-transformer.
Encode the query the exact same way so it lives in the same space.
Compute cosine similarity between the query and every candidate.
Sort by score and return the closest matches.
Scoring against millions of vectors one by one is slow. Approximate nearest-neighbour indexes (FAISS, HNSW) — the heart of a vector database — make it milliseconds.
Words vs documents
- Use word embeddings directly
- "happy" → joyful, glad, cheerful
- Great for synonym & analogy tasks
- Average word vectors, or use a sentence embedding
- Powers search, dedup, recommendation
- TF-IDF + cosine also works as a baseline