May 17, 20266 min readRAGengineering

RAG pipeline architecture — how SeekFiles AI is built

A transparent walk-through of how a production RAG system actually works. For engineers evaluating tools — or building their own.

Most "AI for documents" products are black boxes. We're not — here's the architecture of SeekFiles AI, layer by layer, so engineers evaluating us (or building competitors) know what's actually under the hood.

Stack at a glance

Backend: Laravel 13 / PHP 8.4
Vector store: pgvector on Postgres with HNSW index
Embedding model: OpenAI text-embedding-3-large (1536 dims)
LLM: OpenAI GPT-4o-class for chat, swappable via interface
Queue: Redis + Horizon
Mobile: Flutter (iOS + Android)
Web: Next.js for marketing, Flutter web for the app

Why these picks: pgvector keeps everything in one Postgres (operationally simpler than a separate vector DB at our scale), OpenAI embeddings have the best price-performance ratio, Laravel is fast to build and deploy, Flutter ships native to both phones from one codebase.

Ingestion pipeline

Upload → Validate → Extract → Chunk → Embed → Index

Upload to S3-compatible storage (we use DigitalOcean Spaces).
Extract text. PDFs go through pdftotext + OCR (tesseract for scans). DOCX uses a native parser. Images get OCR'd directly.
Chunk. ~512-token chunks with 64-token overlap. Hierarchical splitting on natural boundaries (paragraphs, headings) before falling back to fixed-size.
Embed. Batch up to 100 chunks per OpenAI API call. Cache embeddings by chunk hash so re-uploads of the same file are nearly free.
Index. Insert into pgvector with HNSW index on the embedding column.

All of this runs as queued jobs; uploads return immediately while indexing happens in the background.

Retrieval pipeline

Question → Embed → Vector search → Keyword search → Merge → Re-rank → LLM

Embed the question with the same model.
Vector search for top-K (default 30) chunks by cosine similarity.
Keyword search (Postgres tsvector) for top-K chunks by BM25-equivalent score.
Merge + dedupe the two result sets.
Re-rank. A cross-encoder scores each candidate against the question; the bottom of the list gets culled.
Send to LLM — final top 5–10 chunks plus the question, in a structured prompt that requires citations.

Citation grounding

The LLM is instructed to:

Cite every claim with the chunk ID it came from.
Refuse if no chunk meaningfully answers the question.
Quote the literal text it used, not paraphrase.

The UI then maps each cited chunk ID to its source chunk, displaying the literal text and page number. The user can verify any claim by tapping the citation.

Scope control

Every Assistant has a scope (folder, file set, or none). At retrieval time, the vector + keyword searches filter by the scoped IDs. This is how "ask only my Civil Reviewer" works — the retrieval can't accidentally pull from "Tax Returns."

Multi-modal retrieval

For PDFs with images (charts, scanned forms, diagrams):

Images get OCR'd; the text becomes part of the chunk.
For tables, we preserve structure (Markdown tables in the chunk text).
Embedded charts get described via vision (when enabled per assistant).

Caching + performance

Embedding cache by chunk content hash.
Retrieval results cached per (assistant, normalised question) for hot questions.
Vector index uses HNSW (ANN) — search is sub-100ms even on millions of chunks.
Long-running operations (re-embedding a whole library) run as background jobs with progress tracking.

What we don't do (yet)

Agentic multi-hop retrieval. Currently single-shot retrieval; multi-hop is in the roadmap for Phase 31.
Continuous learning from user feedback. Feedback is logged but not yet wired into re-ranking.
Cross-tenant model fine-tuning. Privacy-by-default; no shared embeddings.

Why we built it this way

Speed of iteration, operational simplicity, and the ability to swap providers (LLM, embedding model, vector store) without rewriting the app. Each layer is behind an interface; we've already swapped embedding providers once without users noticing.

If you're building your own RAG system, the boring stack (Postgres + Redis + a known LLM) beats the exotic one at this scale. We'll graduate to a dedicated vector DB if and when pgvector becomes the bottleneck — not before.

Newsletter

Like this? Get the next one in your inbox.

Weekly tips on getting more out of your file library — RAG, retrieval tricks, and product updates. No spam.

Try it free

Ask your files anything. Get answers with citations.

50 welcome credits. 3 assistants. No credit card. Upload your first file in under two minutes.

Launch the AI View pricing

All posts