May 18, 20265 min readRAGengineering

RAG explained simply — 2026 edition

Retrieval-augmented generation, without the jargon. Why it matters, how it actually works, and what makes one RAG system better than another.

RAG — retrieval-augmented generation — is the technique behind every AI tool that "knows" your documents. It's also the technique most marketing materials describe wrong. Here's the honest explanation.

The problem RAG solves

Large language models (LLMs) like GPT-4, Claude, Gemini are trained on a snapshot of the internet plus public datasets. They don't know:

Your specific files.
Anything that happened after their training cutoff.
Your company's internal documents.
The contract you uploaded an hour ago.

If you ask an LLM about your contract, it can't answer — it never saw the contract. RAG fixes this by retrieving relevant pieces of your documents and adding them to the prompt the LLM sees, just-in-time.

The minimum viable RAG pipeline

Ingest: Take a document, split it into chunks (say, 500 words each).
Embed: Turn each chunk into a vector (a long list of numbers that captures meaning).
Store: Save the vectors in a database optimised for vector search.
Retrieve: When a user asks a question, embed the question into a vector, and find the most-similar chunks in the database.
Augment: Stuff those chunks into the prompt sent to the LLM, along with the user's question.
Generate: The LLM answers based on the retrieved chunks.

That's the whole thing. The model "knows" your documents because the relevant pieces are now part of its prompt.

What makes one RAG system better than another

The pipeline above is the floor. Real-world systems add:

Hybrid retrieval — combine vector search (semantic) with keyword search (literal). Catches more cases.
Re-ranking — score retrieved chunks against the question and cull the weak ones.
Citation grounding — return the chunks to the user so they can verify.
Refusal training — tell the model to refuse when retrieved chunks don't actually answer.
Chunking strategy — how you split documents matters a lot. 500-word chunks with 50-word overlap is a common starting point.
Embedding model — OpenAI's text-embedding-3-large, Voyage, Cohere; choice affects recall meaningfully.
Multi-modal retrieval — handling images embedded in PDFs (charts, scans).

A "RAG product" without any of these extras is the toy version. A serious one has all of them.

Why "stuffing the whole document into context" isn't RAG

Some products skip retrieval and just dump the whole document into the prompt. This is called "long-context stuffing." It works for short documents but:

Costs scale linearly with document length (you pay for tokens you don't need).
Quality drops past ~50k tokens (models lose precision in long context).
Doesn't scale to libraries.

RAG is the answer to "how do we make AI work on libraries, not just files."

When RAG is overkill

Very short documents (one page, a screenshot).
Brainstorming where the document isn't the source of truth.
One-off chats where you don't care about citations.

For those, just upload to ChatGPT and move on. RAG is for libraries that you'll query repeatedly.

What's next in RAG

Agentic RAG — the model decides which retrieval to do next based on partial answers.
Multi-modal RAG — retrieval across text + images + tables in a unified index.
Long-context-augmented RAG — combining retrieval with 1M+ context for hybrid workflows.

The category is moving fast. The fundamentals above stay the same.

Newsletter

Like this? Get the next one in your inbox.

Weekly tips on getting more out of your file library — RAG, retrieval tricks, and product updates. No spam.

Try it free

Ask your files anything. Get answers with citations.

50 welcome credits. 3 assistants. No credit card. Upload your first file in under two minutes.

Launch the AI View pricing

All posts