Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Error handling needs changes #109

Open
joerg-hermanns opened this issue Jan 11, 2025 · 4 comments
Open

Bug: Error handling needs changes #109

joerg-hermanns opened this issue Jan 11, 2025 · 4 comments

Comments

@joerg-hermanns
Copy link

I just discovered this morning, that paperless-gpt is kind of "stalled" with OCR
That is due the fact that it tried to process a "too big" document as it seems:

time="2025-01-11T10:54:35Z" level=debug msg="Image dimensions: 12600x16800"
time="2025-01-11T10:54:35Z" level=debug msg="Image size: 15274 KB"
time="2025-01-11T10:54:43Z" level=error msg="Error in processAutoTagDocuments: error in processAutoOcrTagDocuments: error processing document OCR: error performing OCR: error getting response from LLM: API returned unexpected status code: 400: You uploaded an unsupported image. Please make sure your image is valid."

Now it everytime tries to reprocess this document - but obiously the error message will not change.
At least we need two things here i think:

  1. A configurable limit of maximum sizes for a picture to be sent to OCD (maybe based on document size instead of pixel dimensions??)
  2. A kind of error handling which for example retries 5 times or so and the puts that specific document to an error queue, or maybe just tags it with a (configurable) tag in paperless (ex: ai-ocr-failed)

For the moment is my question: How can i identify which exact document this is ... ?
I have 953 documents in the processing queue ...

@icereed - Based on which API query to paperless do you get the next document to be processed?

@joerg-hermanns
Copy link
Author

Addition: I assume it might be one of the documents, which already IS a JPEG if that helps for debugging?

@joerg-hermanns
Copy link
Author

Addition: Got this one on a normal document. It seems the database is overloaded at the moment.
Will paperless-gpt retry this one ... ?

time="2025-01-11T12:41:38Z" level=error msg="Error updating document 2905: 500, \n<!doctype html>\n<html lang="en">\n\n <title>Server Error (500)</title>\n\n\n

Server Error (500)

\n\n\n"
time="2025-01-11T12:41:38Z" level=error msg="Error in processAutoTagDocuments: error in processAutoOcrTagDocuments: error updating documents: error updating document 2905: 500, \n<!doctype html>\n<html lang="en">\n\n <title>Server Error (500)</title>\n\n\n

Server Error (500)

\n\n\n"

@joerg-hermanns
Copy link
Author

got another one. Any ideas here?

time="2025-01-12T15:20:00Z" level=error msg="Error in processAutoTagDocuments: error in processAutoOcrTagDocuments: error processing document OCR: error downloading document images: fitz: cannot open document"
time="2025-01-12T15:22:51Z" level=debug msg="Found at least 25 remaining documents with tag ai-ocr"
format error: cannot recognize version marker
warning: trying to repair broken xref
warning: repairing PDF document
warning: name is too long

@icereed
Copy link
Owner

icereed commented Jan 13, 2025

First step of enhanced logging and error reporting is implemented in #114

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants