I Built a Free Invoice Extraction Tool — Here's What I Learned About PDF Parsing

April 2026 · 8 min read

PDF parsing sounds easy until you try it. "Just read the text" — sure, if the text is actually text and not a collection of absolutely positioned glyphs rendered by a layout engine designed in 1993.

We built a free browser-based invoice extractor and learned more about the dark corners of the PDF spec than we ever wanted to. Here's what we found.

The Problem

Businesses process millions of invoices. Most arrive as PDFs. Getting the data out — vendor name, invoice number, line items, amounts, tax, total — is still a manual process for a shocking number of companies.

The enterprise solutions (Rossum, Nanonets, Mindee) work well but cost $200+/month. The free tools (Tabula, pdf.js) work on simple cases but break on real-world invoices. We wanted to build something in between: free, accurate enough for 80% of invoices, and running entirely in the browser for privacy.

Lesson 1: PDFs Don't Have "Text"

The first surprise: a PDF doesn't contain text the way you think it does. It contains a series of drawing instructions. "Move to position (72, 340). Set font to Helvetica 10pt. Draw glyphs 'I', 'n', 'v', 'o', 'i', 'c', 'e'."

There's no concept of a paragraph, a table cell, or even a word. You have to reconstruct all of that from the raw glyph positions. This means:

And "guess" is doing a lot of work in those sentences.

Lesson 2: Every Invoice Is a Snowflake

We tested with invoices from 200+ companies. Some patterns:

Lesson 3: Tables Are the Hard Part

Most invoice data lives in tables. Line items, quantities, unit prices, amounts — all in a tabular format. Except:

The approach that actually works: use a combination of column detection (finding consistent vertical alignments) and row detection (finding consistent horizontal gaps). Then cross-reference with the column headers to figure out which column each value belongs to.

Lesson 4: Scanned PDFs Are a Different Beast

About 15% of the invoices we tested were scanned images embedded in PDFs. For these, you need OCR (Optical Character Recognition) before you can do anything else.

Browser-based OCR with Tesseract.js works but it's slow (~5-10 seconds per page) and accuracy varies wildly with scan quality. The best approach for production is to detect whether a PDF is text-based or image-based, and route accordingly.

Lesson 5: "Good Enough" Beats "Perfect"

After months of work, here's our honest accuracy assessment:

That sounds low for automation. But here's the thing: 80% extracted + human review is still 5x faster than 100% manual entry. The tool doesn't need to be perfect — it needs to be a better starting point than a blank spreadsheet.

The goal isn't to replace the human. It's to do the boring part so the human can focus on the exceptions.

What We Built

Our invoice extractor runs entirely in the browser. Your PDF never leaves your machine — which matters when you're handling sensitive financial documents.

It uses pdf.js for text extraction, a custom table detection algorithm for layout analysis, and pattern matching for field identification. No AI, no cloud APIs, no data collection. Just JavaScript and your browser.

Try it free → Extract your first invoice in 60 seconds

No signup. No upload. Your files stay on your machine.

What's Next

We're working on:

If you process invoices and want to save time, give it a try. If it doesn't work on your invoices, tell us — every failure report makes the tool better.


Discussion: Comment on Hacker News · r/pdf