I Built a Free Invoice Extraction Tool — Here's What I Learned About PDF Parsing

April 2026 · 8 min read

PDF parsing sounds easy until you try it. "Just read the text" — sure, if the text is actually text and not a collection of absolutely positioned glyphs rendered by a layout engine designed in 1993.

We built a free browser-based invoice extractor and learned more about the dark corners of the PDF spec than we ever wanted to. Here's what we found.

The Problem

Businesses process millions of invoices. Most arrive as PDFs. Getting the data out — vendor name, invoice number, line items, amounts, tax, total — is still a manual process for a shocking number of companies.

The enterprise solutions (Rossum, Nanonets, Mindee) work well but cost $200+/month. The free tools (Tabula, pdf.js) work on simple cases but break on real-world invoices. We wanted to build something in between: free, accurate enough for 80% of invoices, and running entirely in the browser for privacy.

Lesson 1: PDFs Don't Have "Text"

The first surprise: a PDF doesn't contain text the way you think it does. It contains a series of drawing instructions. "Move to position (72, 340). Set font to Helvetica 10pt. Draw glyphs 'I', 'n', 'v', 'o', 'i', 'c', 'e'."

There's no concept of a paragraph, a table cell, or even a word. You have to reconstruct all of that from the raw glyph positions. This means:

Word boundaries — you guess based on gaps between glyphs
Line boundaries — you guess based on vertical position changes
Table cells — you guess based on column alignment

And "guess" is doing a lot of work in those sentences.

Lesson 2: Every Invoice Is a Snowflake

We tested with invoices from 200+ companies. Some patterns:

40% put the total in the bottom-right. 30% put it in the bottom-left. 20% put it somewhere random. 10% have multiple "total" fields that mean different things.
"Invoice Number" appears as: Invoice #, Inv No., Invoice Number, Reference, Document ID, Bill Number, and about 15 other variations.
Date formats: 04/12/2026, 12/04/2026, 2026-04-12, April 12, 2026, 12-Apr-26, 12.04.2026. And you can't reliably tell if 04/12 is April 12th or December 4th without country context.

Lesson 3: Tables Are the Hard Part

Most invoice data lives in tables. Line items, quantities, unit prices, amounts — all in a tabular format. Except:

Some tables have visible gridlines. Some don't.
Some tables have consistent column spacing. Some use variable spacing that depends on content width.
Some cells span multiple lines (a product description that wraps).
Some invoices have nested tables (grouped by category or delivery).

The approach that actually works: use a combination of column detection (finding consistent vertical alignments) and row detection (finding consistent horizontal gaps). Then cross-reference with the column headers to figure out which column each value belongs to.

Lesson 4: Scanned PDFs Are a Different Beast

About 15% of the invoices we tested were scanned images embedded in PDFs. For these, you need OCR (Optical Character Recognition) before you can do anything else.

Browser-based OCR with Tesseract.js works but it's slow (~5-10 seconds per page) and accuracy varies wildly with scan quality. The best approach for production is to detect whether a PDF is text-based or image-based, and route accordingly.

Lesson 5: "Good Enough" Beats "Perfect"

After months of work, here's our honest accuracy assessment:

Vendor name/address: 95% accuracy
Invoice number: 90% accuracy
Dates: 85% accuracy (date format ambiguity is the killer)
Line items: 80% accuracy on simple tables, 60% on complex ones
Totals: 92% accuracy

That sounds low for automation. But here's the thing: 80% extracted + human review is still 5x faster than 100% manual entry. The tool doesn't need to be perfect — it needs to be a better starting point than a blank spreadsheet.

The goal isn't to replace the human. It's to do the boring part so the human can focus on the exceptions.

What We Built

Our invoice extractor runs entirely in the browser. Your PDF never leaves your machine — which matters when you're handling sensitive financial documents.

It uses pdf.js for text extraction, a custom table detection algorithm for layout analysis, and pattern matching for field identification. No AI, no cloud APIs, no data collection. Just JavaScript and your browser.

Try it free → Extract your first invoice in 60 seconds

No signup. No upload. Your files stay on your machine.

What's Next

We're working on:

Better table detection for complex layouts
An API for batch processing (email us if you're interested)
Better handling of multi-page invoices
Integration guides for QuickBooks, Xero, and Sage

If you process invoices and want to save time, give it a try. If it doesn't work on your invoices, tell us — every failure report makes the tool better.

Discussion: Comment on Hacker News · r/pdf