Data Handling Between Paperless-ngx and Paperless AI #374

stefanneubig · 2025-02-21T11:20:54Z

stefanneubig
Feb 21, 2025

Hi there,

sorry, for asking something that might be obvious, but I couldn't fit.
Does Paperless AI process the original PDFs or images, performing its own OCR using the LLM APIs, or does it utilize the text content already extracted by Paperless-ngx’s OCR engine for AI analysis?

So are there files or images transferred to the AI and it requires vision models or is it just performing additional analysis on the already extracted text?

Thank you!

clusterzx · 2025-02-21T11:26:21Z

clusterzx
Feb 21, 2025
Maintainer

Hey Stefan,

It utilizes the existing OCR data provided by Paperless-Ngx.
But I am working on an update to perform (if users activates it) a OCR via AI.

As for now it would not work to use vision models if they are not also capable of llm instructions and undestanding.
Is there a need for you to re-analyze the OCR by Paperless-AI?

0 replies

stefanneubig · 2025-02-21T11:33:56Z

stefanneubig
Feb 21, 2025
Author

Thank you! I was just wondering, because this was not clear. I think it would be a great feature to enable this as an option. For physical receipts that are submitted as photographs (gas station receipts, restaurant receipts) I experience a much better result from a vision model (eg. Gemini) compared with Tesseract.

Typical example attached.
2025-02-19 Engen Kundenbeleg vom 2025-02-19.pdf

0 replies

christianlouis · 2025-02-21T12:02:55Z

christianlouis
Feb 21, 2025
Sponsor

Jumping on this thread here as well, as Stefan describes the same issues I have. Textract isn't that great, and it seems that 3rd party OCR plugins for Paperless-ngx are still not materializing that quickly.

I've solved this with a custom intake script that pre-processes my documents with Azure's Document API right now before they end up in Paperless.

But what to do with the documents that already exist in Paperless and have less-than-optimal OCR quality?

Maybe one could take the OCR'ed text within Paperless, send it to an LLM for "quality analysis" and upon the LLM deciding that the OCR quality isn't that great (that may be something the LLM could just tell from the text content of the Paperless document or so I hope?) a proper re-processing of the entire document, with the goal to extract a better fulltext version can be triggered.

I'd suggest using dedicated OCR APIs for that, such as Azure AI Document Intelligence, Amazon Textract or as you mentioned Gemini.
Azure AI Document Intelligence has the additional benefit of being able to process very long documents (Textract wasn't as great doing that) and also gives the ability to provide a new PDF with embedded text (something that I for example desire).

2 replies

stefanneubig Feb 21, 2025
Author

This makes a lot of sense. Fully agree.
I'm very curious and interested about your pre-processing script. Would you mind sharing it?

christianlouis Feb 21, 2025
Sponsor

Sure, my approach is here: https://github.com/christianlouis/document-processor
It's a docker container based approach with a celery worker that pulls from IMAP sources, sends throgh OCR and uploads to multiple targets (Nextcloud, Paperless, Dropbox). Please feel free to take this as an inspiration, though it's not as lightweight as it could be.
You'll also see some approaches towards what paperless-ai does, but I pulled that from the code, once I found this wonderful project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Handling Between Paperless-ngx and Paperless AI #374

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Data Handling Between Paperless-ngx and Paperless AI #374

stefanneubig Feb 21, 2025

Replies: 3 comments · 2 replies

clusterzx Feb 21, 2025 Maintainer

stefanneubig Feb 21, 2025 Author

christianlouis Feb 21, 2025 Sponsor

stefanneubig Feb 21, 2025 Author

christianlouis Feb 21, 2025 Sponsor

stefanneubig
Feb 21, 2025

Replies: 3 comments 2 replies

clusterzx
Feb 21, 2025
Maintainer

stefanneubig
Feb 21, 2025
Author

christianlouis
Feb 21, 2025
Sponsor

stefanneubig Feb 21, 2025
Author

christianlouis Feb 21, 2025
Sponsor