Back to blog
May 17, 20265 min readhow-touse case

How to extract structured data from scanned PDFs with AI

Receipts, forms, tables, invoices — pulling clean rows out of messy PDFs. A practical guide that survives real-world quality.

The most common "I wish AI could do this" task in small business is data extraction from PDFs. Receipts, invoices, tax forms, 1099s, expense reports. The data is in there — you just need it in a spreadsheet.

Here's how to do it honestly.

What "extract data from a PDF" actually means

There are three flavours, each with different difficulty:

  1. Born-digital structured PDFs — invoices generated by software with real text layers. Easiest.
  2. Scanned PDFs with clear text — phone-photo of a receipt or a flatbed scan. Needs OCR but readable.
  3. Faded, skewed, or handwritten scans — older records, carbon copies, written notes. Hardest, AI helps but doesn't solve.

SeekFiles AI handles all three because it runs OCR on upload and indexes the extracted text alongside any born-digital content.

The workflow

Step 1 — Group similar documents

If you're extracting receipts, put all receipts in a folder. Don't mix receipts with invoices in the same extraction batch — the schema is different.

Step 2 — Define the schema you want

Don't ask the AI to "extract everything." Ask for specific fields:

  • For receipts: merchant, date, total, tax, payment_method
  • For invoices: invoice_number, issue_date, due_date, line_items, subtotal, tax, total
  • For 1099s: payer_name, payer_tin, payee_name, payee_tin, box_1_compensation, box_4_withheld

Step 3 — Ask in a structured way

Open a chat in the assistant scoped to that folder. Ask:

"For each receipt in this folder, extract: merchant name, date, total, tax amount, payment method. Format as a markdown table with one row per file."

The retrieval pulls each file's contents; the LLM extracts the requested fields and assembles the table.

Step 4 — Verify the high-stakes rows

Skim the table against the cited chunks. Mistakes happen most often when:

  • A receipt is partially cut off (the total isn't fully readable).
  • Two amounts on the same line look like the total but one is a subtotal.
  • Currency symbols are missing (the model assumes USD when it's another currency).

Always sanity-check the totals against your bank or card statement.

Step 5 — Export

Copy the markdown table → paste into Google Sheets or Excel. They handle markdown tables natively now.

When this fails

  • Handwritten content. OCR isn't reliable on handwriting. Type it manually or use a specialist OCR provider.
  • Densely-packed tables with merged cells. AI struggles with layout-heavy tables. For these, use a structured extraction tool like Camelot or invoice2data.
  • Inconsistent invoice formats across vendors. Run each vendor in its own batch and merge later.

Cost reality

Extracting 50 receipts costs roughly the same as a 5-minute chat session. Way cheaper than 50 minutes of manual data entry, and the result is auditable because every value is cited back to the receipt page.

For monthly reconciliation, this workflow saves hours every cycle. For one-off questions, it's overkill — just open the PDF.

Newsletter

Like this? Get the next one in your inbox.

Weekly tips on getting more out of your file library — RAG, retrieval tricks, and product updates. No spam.

no spam · unsubscribe in one click

Try it free

Ask your files anything. Get answers with citations.

50 welcome credits. 3 assistants. No credit card. Upload your first file in under two minutes.

We use cookies

We use essential cookies for sign-in and session security, plus local storage for your theme preference. We don't set third-party advertising cookies. See our Privacy Policy.