How to extract structured data from scanned PDFs with AI
Receipts, forms, tables, invoices — pulling clean rows out of messy PDFs. A practical guide that survives real-world quality.
The most common "I wish AI could do this" task in small business is data extraction from PDFs. Receipts, invoices, tax forms, 1099s, expense reports. The data is in there — you just need it in a spreadsheet.
Here's how to do it honestly.
What "extract data from a PDF" actually means
There are three flavours, each with different difficulty:
- Born-digital structured PDFs — invoices generated by software with real text layers. Easiest.
- Scanned PDFs with clear text — phone-photo of a receipt or a flatbed scan. Needs OCR but readable.
- Faded, skewed, or handwritten scans — older records, carbon copies, written notes. Hardest, AI helps but doesn't solve.
SeekFiles AI handles all three because it runs OCR on upload and indexes the extracted text alongside any born-digital content.
The workflow
Step 1 — Group similar documents
If you're extracting receipts, put all receipts in a folder. Don't mix receipts with invoices in the same extraction batch — the schema is different.
Step 2 — Define the schema you want
Don't ask the AI to "extract everything." Ask for specific fields:
- For receipts:
merchant, date, total, tax, payment_method - For invoices:
invoice_number, issue_date, due_date, line_items, subtotal, tax, total - For 1099s:
payer_name, payer_tin, payee_name, payee_tin, box_1_compensation, box_4_withheld
Step 3 — Ask in a structured way
Open a chat in the assistant scoped to that folder. Ask:
"For each receipt in this folder, extract: merchant name, date, total, tax amount, payment method. Format as a markdown table with one row per file."
The retrieval pulls each file's contents; the LLM extracts the requested fields and assembles the table.
Step 4 — Verify the high-stakes rows
Skim the table against the cited chunks. Mistakes happen most often when:
- A receipt is partially cut off (the total isn't fully readable).
- Two amounts on the same line look like the total but one is a subtotal.
- Currency symbols are missing (the model assumes USD when it's another currency).
Always sanity-check the totals against your bank or card statement.
Step 5 — Export
Copy the markdown table → paste into Google Sheets or Excel. They handle markdown tables natively now.
When this fails
- Handwritten content. OCR isn't reliable on handwriting. Type it manually or use a specialist OCR provider.
- Densely-packed tables with merged cells. AI struggles with layout-heavy tables. For these, use a structured extraction tool like Camelot or invoice2data.
- Inconsistent invoice formats across vendors. Run each vendor in its own batch and merge later.
Cost reality
Extracting 50 receipts costs roughly the same as a 5-minute chat session. Way cheaper than 50 minutes of manual data entry, and the result is auditable because every value is cited back to the receipt page.
For monthly reconciliation, this workflow saves hours every cycle. For one-off questions, it's overkill — just open the PDF.
Like this? Get the next one in your inbox.
Weekly tips on getting more out of your file library — RAG, retrieval tricks, and product updates. No spam.
Try it free
Ask your files anything. Get answers with citations.
50 welcome credits. 3 assistants. No credit card. Upload your first file in under two minutes.