Embeddings vs keyword search — when to use which
Vector embeddings are the new search hotness. Old-school keyword search isn't dead. Here's when each one wins.
The default modern advice is "use embeddings for search." It's right, mostly. But classic keyword search (BM25 / TF-IDF) still beats embeddings in a handful of cases, and the best systems combine both. Here's the honest breakdown.
What embeddings do
Turn a text chunk into a vector of ~1500 numbers that represents its meaning. Search by finding the chunks whose vectors are closest to the query's vector.
Strengths:
- Semantic matches. "Severance pay" matches "termination compensation" without sharing words.
- Synonym tolerance. No need for the user's wording to match the document's wording.
- Concept-level retrieval. "Risk clauses" finds clauses without the word "risk" if they're about risk.
Weaknesses:
- Literal phrase blindness. "What did Acme say about Section 5.4.2?" — embeddings often miss the literal "5.4.2" reference.
- Rare term struggles. Proper nouns, codes, IDs, dates often don't embed well — they're too specific.
- Expensive at indexing. Embeddings cost compute to generate.
What keyword search does
Index every word. Search by finding chunks containing the query terms (with some scoring like BM25 for relevance).
Strengths:
- Literal precision. "Section 5.4.2," "G.R. No. 12345," "patient ID 4471" — these win on keyword.
- Rare term recall. Unusual proper nouns are found if they're in the document.
- Cheap and fast. Postgres
tsvectoror any inverted index.
Weaknesses:
- No synonym tolerance. "Severance" misses "termination compensation."
- Vocabulary mismatch failure mode. If the user's words and the document's words differ, recall plummets.
- Bag-of-words. Ignores meaning; ranks by term frequency.
When embeddings win
- Conceptual questions: "What's the gist of Chapter 3?"
- Synonym-heavy domains: "What's the policy on remote work?" (might be labelled "work from anywhere" or "telecommute").
- Cross-language: query in English, document in another language (with multilingual embeddings).
When keyword search wins
- Exact-phrase lookup: "Where does it say 'gross negligence'?"
- ID / code lookup: "Find G.R. No. 12345" / "Find patient ID 4471."
- Recent / dated references: "Find the 2024 update."
- Acronyms and unusual terminology.
Why hybrid is the answer
In production, you almost always run both, then combine.
- Run vector search → top 20 candidates.
- Run keyword search → top 20 candidates.
- Merge, deduplicate, re-rank.
- Send top 5–10 to the LLM.
This catches both the conceptual matches embeddings find AND the exact-phrase matches embeddings miss. It's the floor for production RAG.
What SeekFiles AI does
Hybrid retrieval over pgvector (HNSW index for vectors) + Postgres full-text search (tsvector + BM25-equivalent scoring) + a re-ranker pass. Published recall metrics in the docs.
When to use only one
- Embeddings only: for chat-style Q&A on conceptually clean documents (books, articles, reviewers).
- Keyword only: for legal docket search, scientific paper search by author + year, codebases.
For everything else, hybrid. The cost of running both is marginal; the recall gain is significant.
Like this? Get the next one in your inbox.
Weekly tips on getting more out of your file library — RAG, retrieval tricks, and product updates. No spam.
Try it free
Ask your files anything. Get answers with citations.
50 welcome credits. 3 assistants. No credit card. Upload your first file in under two minutes.