PDF parsing sounds easy until you try it. "Just read the text" — sure, if the text is actually text and not a collection of absolutely positioned glyphs rendered by a layout engine designed in 1993.
We built a free browser-based invoice extractor and learned more about the dark corners of the PDF spec than we ever wanted to. Here's what we found.
Businesses process millions of invoices. Most arrive as PDFs. Getting the data out — vendor name, invoice number, line items, amounts, tax, total — is still a manual process for a shocking number of companies.
The enterprise solutions (Rossum, Nanonets, Mindee) work well but cost $200+/month. The free tools (Tabula, pdf.js) work on simple cases but break on real-world invoices. We wanted to build something in between: free, accurate enough for 80% of invoices, and running entirely in the browser for privacy.
The first surprise: a PDF doesn't contain text the way you think it does. It contains a series of drawing instructions. "Move to position (72, 340). Set font to Helvetica 10pt. Draw glyphs 'I', 'n', 'v', 'o', 'i', 'c', 'e'."
There's no concept of a paragraph, a table cell, or even a word. You have to reconstruct all of that from the raw glyph positions. This means:
And "guess" is doing a lot of work in those sentences.
We tested with invoices from 200+ companies. Some patterns:
Invoice #, Inv No., Invoice Number, Reference, Document ID, Bill Number, and about 15 other variations.04/12/2026, 12/04/2026, 2026-04-12, April 12, 2026, 12-Apr-26, 12.04.2026. And you can't reliably tell if 04/12 is April 12th or December 4th without country context.Most invoice data lives in tables. Line items, quantities, unit prices, amounts — all in a tabular format. Except:
The approach that actually works: use a combination of column detection (finding consistent vertical alignments) and row detection (finding consistent horizontal gaps). Then cross-reference with the column headers to figure out which column each value belongs to.
About 15% of the invoices we tested were scanned images embedded in PDFs. For these, you need OCR (Optical Character Recognition) before you can do anything else.
Browser-based OCR with Tesseract.js works but it's slow (~5-10 seconds per page) and accuracy varies wildly with scan quality. The best approach for production is to detect whether a PDF is text-based or image-based, and route accordingly.
After months of work, here's our honest accuracy assessment:
That sounds low for automation. But here's the thing: 80% extracted + human review is still 5x faster than 100% manual entry. The tool doesn't need to be perfect — it needs to be a better starting point than a blank spreadsheet.
The goal isn't to replace the human. It's to do the boring part so the human can focus on the exceptions.
Our invoice extractor runs entirely in the browser. Your PDF never leaves your machine — which matters when you're handling sensitive financial documents.
It uses pdf.js for text extraction, a custom table detection algorithm for layout analysis, and pattern matching for field identification. No AI, no cloud APIs, no data collection. Just JavaScript and your browser.
No signup. No upload. Your files stay on your machine.
We're working on:
If you process invoices and want to save time, give it a try. If it doesn't work on your invoices, tell us — every failure report makes the tool better.
Discussion: Comment on Hacker News · r/pdf