Digitizer
If you can read it, it will be digitized. Even when you can't.
Drop in a PDF or image — printed, scanned, faxed, rotated, multilingual, even handwritten — get back clean structured JSON. A hybrid pipeline combines classical OCR (Tesseract OSD, bbox geometry) with a vision-language model (Qwen3-VL-235B) so each step uses the cheapest tool that solves it correctly. Every field comes back with a bounding box, so the UI can highlight exactly what the model saw. API-key authenticated and agent-ready out of the box.
How it works
Handles anything you point at it
Handwritten cursive, sideways scans, smudged receipts, faxes, multilingual invoices, free-form letters. No template, no schema, no labels required.
Two-tier rotation, never sideways
Tesseract OSD handles the 80% case for cents. A VLM fallback judges orientation when OSD's confidence drops. Pages always come back upright.
Tight bounding boxes via a second-pass locator
The VLM's bboxes drift 5-10%. A focused locator call snaps each box to the cell after extraction, so highlights line up with the source.
Structured JSON, not OCR text
Every field is named, typed, and tied to coordinates. Items become tables, lists become numbered lists, prose stays prose. Render it, ingest it, agent it.
Agent-ready
API-key login means an LLM agent can authenticate, upload, and consume the JSON without a browser. Swagger UI at /docs, raw spec at /openapi.json.
Tested at every layer
Rotation detection, bbox geometry, vision locator, network retry, frontend rendering — each step has unit and integration tests. Not a notebook.