Mini Project: Finding Similar Words & Documents

NLP project embeddings semantic search

The payoff of embeddings

Now combine the pieces: represent everything as vectors, then rank by cosine similarity to find the nearest neighbours of any query. That's semantic search — and it works on words and whole documents.

It's the same engine behind "related articles", "customers also bought", and the retrieval step in RAG.

Watch a query find its neighbours

Pick the word "happy", score it against every other word's vector, and surface the closest matches.

The recipe

1. Embed everything vectors

Map each word/document to a vector — Word2Vec, GloVe, or a sentence-transformer.

2. Embed the query same space

Encode the query the exact same way so it lives in the same space.

3. Score cosine to all

Compute cosine similarity between the query and every candidate.

4. Rank & return top-k nearest

Sort by score and return the closest matches.

At scale

Scoring against millions of vectors one by one is slow. Approximate nearest-neighbour indexes (FAISS, HNSW) — the heart of a vector database — make it milliseconds.

Words vs documents

Similar words
  • Use word embeddings directly
  • "happy" → joyful, glad, cheerful
  • Great for synonym & analogy tasks
Similar documents
  • Average word vectors, or use a sentence embedding
  • Powers search, dedup, recommendation
  • TF-IDF + cosine also works as a baseline