May 16, 20265 min readRAGengineering

Multi-modal RAG — chatting with the images inside your PDFs

Most RAG systems silently ignore images embedded in PDFs. Charts, scanned forms, diagrams — all invisible. Here's why that matters and how to fix it.

Open any real-world PDF and you'll see images: chart screenshots, scanned forms, photo evidence, engineering diagrams, hand-drawn sketches in a notebook. A text-only RAG system silently ignores all of it. The user uploads a 50-page report with 30 charts; the assistant only knows what was in the text body.

Multi-modal RAG fixes this by treating images as first-class retrieval targets.

What "multi-modal" actually means

Three flavours, in increasing sophistication:

OCR on images. Run OCR over each embedded image; the extracted text gets indexed alongside the body text. Works for scanned forms, screenshots of text. Misses chart visuals, diagrams.
Image captioning. Use a vision model to generate a text description of each image ("Bar chart showing Q1 revenue across regions, with APAC leading at $4.2M"). Index the caption. Works for charts and diagrams; quality depends on the vision model.
True multi-modal retrieval. Embed images and text in a shared vector space (CLIP-style). At query time, retrieve across both modalities by similarity. The most capable; the most expensive.

SeekFiles AI does (1) and (2) by default, with (3) for assistants where it's enabled.

When multi-modal retrieval matters

Engineering and architecture. Spec sheets, diagrams, exploded views. Text-only misses half the document.
Financial reports. Charts are the most important content; ignoring them is incompetent.
Medical records. Scanned forms, imaging captions, lab strips.
Education. Diagrams in biology, chemistry, physics textbooks.
Legal evidence. Photographs as evidence in case files.

For pure-prose documents (contracts, essays, articles), multi-modal adds nothing.

How it works end-to-end at SeekFiles

On upload, we extract embedded images from each PDF page.
For each image: run OCR (catches text in screenshots) + run a vision model to generate a description (catches visual content).
The OCR + description text becomes a "chunk" associated with that image, with the page number and image-on-page coordinates.
At retrieval time, vector + keyword search ranges over both prose chunks and image-derived chunks.
When an image-chunk is cited, the UI shows both the description and the original image so the user can verify.

Gotchas

Vision model cost. Captioning every image gets expensive on large libraries. We batch + cache aggressively.
Caption quality varies. Charts caption well; abstract diagrams caption poorly. We surface the image directly so the user can read it themselves.
OCR quality on low-resolution PDFs. Garbage in, garbage out. High-DPI PDFs are dramatically better.
Chart math. Models often misread chart numbers. Don't trust a captioned number; verify by looking at the chart.

What you should expect from a multi-modal RAG system

Citations that include the page and the image region.
Reasonable answers to "what does the chart on page 14 show?"
Honest refusal when an image is too low-resolution to read.
The original image surfaced in the citation, not just a paraphrase.

What's coming

True multi-modal embedding models (text + image in one vector space) are improving fast. We expect to roll out per-assistant true-multi-modal retrieval in 2026 — making "what page has the bar chart with APAC at the top?" a first-class question. For now, OCR + captioning covers ~90% of real-world cases.

If you have visual-heavy documents and are using a text-only AI tool, switch. The blind spot is bigger than you think.

Newsletter

Like this? Get the next one in your inbox.

Weekly tips on getting more out of your file library — RAG, retrieval tricks, and product updates. No spam.

Try it free

Ask your files anything. Get answers with citations.

50 welcome credits. 3 assistants. No credit card. Upload your first file in under two minutes.

Launch the AI View pricing

All posts