Back to blog
May 17, 20264 min readRAGengineering

Embeddings vs keyword search — when to use which

Vector embeddings are the new search hotness. Old-school keyword search isn't dead. Here's when each one wins.

The default modern advice is "use embeddings for search." It's right, mostly. But classic keyword search (BM25 / TF-IDF) still beats embeddings in a handful of cases, and the best systems combine both. Here's the honest breakdown.

What embeddings do

Turn a text chunk into a vector of ~1500 numbers that represents its meaning. Search by finding the chunks whose vectors are closest to the query's vector.

Strengths:

  • Semantic matches. "Severance pay" matches "termination compensation" without sharing words.
  • Synonym tolerance. No need for the user's wording to match the document's wording.
  • Concept-level retrieval. "Risk clauses" finds clauses without the word "risk" if they're about risk.

Weaknesses:

  • Literal phrase blindness. "What did Acme say about Section 5.4.2?" — embeddings often miss the literal "5.4.2" reference.
  • Rare term struggles. Proper nouns, codes, IDs, dates often don't embed well — they're too specific.
  • Expensive at indexing. Embeddings cost compute to generate.

What keyword search does

Index every word. Search by finding chunks containing the query terms (with some scoring like BM25 for relevance).

Strengths:

  • Literal precision. "Section 5.4.2," "G.R. No. 12345," "patient ID 4471" — these win on keyword.
  • Rare term recall. Unusual proper nouns are found if they're in the document.
  • Cheap and fast. Postgres tsvector or any inverted index.

Weaknesses:

  • No synonym tolerance. "Severance" misses "termination compensation."
  • Vocabulary mismatch failure mode. If the user's words and the document's words differ, recall plummets.
  • Bag-of-words. Ignores meaning; ranks by term frequency.

When embeddings win

  • Conceptual questions: "What's the gist of Chapter 3?"
  • Synonym-heavy domains: "What's the policy on remote work?" (might be labelled "work from anywhere" or "telecommute").
  • Cross-language: query in English, document in another language (with multilingual embeddings).

When keyword search wins

  • Exact-phrase lookup: "Where does it say 'gross negligence'?"
  • ID / code lookup: "Find G.R. No. 12345" / "Find patient ID 4471."
  • Recent / dated references: "Find the 2024 update."
  • Acronyms and unusual terminology.

Why hybrid is the answer

In production, you almost always run both, then combine.

  • Run vector search → top 20 candidates.
  • Run keyword search → top 20 candidates.
  • Merge, deduplicate, re-rank.
  • Send top 5–10 to the LLM.

This catches both the conceptual matches embeddings find AND the exact-phrase matches embeddings miss. It's the floor for production RAG.

What SeekFiles AI does

Hybrid retrieval over pgvector (HNSW index for vectors) + Postgres full-text search (tsvector + BM25-equivalent scoring) + a re-ranker pass. Published recall metrics in the docs.

When to use only one

  • Embeddings only: for chat-style Q&A on conceptually clean documents (books, articles, reviewers).
  • Keyword only: for legal docket search, scientific paper search by author + year, codebases.

For everything else, hybrid. The cost of running both is marginal; the recall gain is significant.

Newsletter

Like this? Get the next one in your inbox.

Weekly tips on getting more out of your file library — RAG, retrieval tricks, and product updates. No spam.

no spam · unsubscribe in one click

Try it free

Ask your files anything. Get answers with citations.

50 welcome credits. 3 assistants. No credit card. Upload your first file in under two minutes.

We use cookies

We use essential cookies for sign-in and session security, plus local storage for your theme preference. We don't set third-party advertising cookies. See our Privacy Policy.