Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCR contents filled with &nbsp #148

Closed
brandondb1 opened this issue Jan 23, 2025 · 4 comments
Closed

OCR contents filled with &nbsp #148

brandondb1 opened this issue Jan 23, 2025 · 4 comments

Comments

@brandondb1
Copy link

brandondb1 commented Jan 23, 2025

For some of the documents I have (receipts from restaurants etc., not multi-page documents), the auto-ocr returns some part of the text, and then fills in the rest with repetitive '&nbsp'. For some documents, deleting the content and redoing the ocr fixes the problem, for others, it does not. I have changed OCR models (4o to 4omini) and as I said deleted the content and the behaviour does not change.

Example, from a till tape from a restaurant called Bone Daddies (the original image is a jpg photo from a phone):

Image

The native OCR from Paperless didn't have problems scanning the image (although it wasn't very accurate).

The Paperless-GPT OCR otherwise works very well (everything I have tried has been a jpg photo), even in rotated images.

Thanks for your great work on this - it is an awesome addition to Paperless!

@icereed
Copy link
Owner

icereed commented Jan 23, 2025

Hi there, thanks for reporting.

If you wanna try custom prompt templates, you could try to add to the prompt that it shall not use   but instead print spaces.

@brandondb1
Copy link
Author

Ah, that could be...I had assumed it was a parsing error somewhere in the connection to the API...I suppose OpenAI could literally be returning '&nbsp'...I'll try that and report back. Thanks!

@brandondb1
Copy link
Author

I deleted and re-added the image I had the most trouble with, as well as updating the template as you suggested. When I ran it again, it worked perfectly this time, so either it was the template change or somehow reuploading the image fixed the problem (id imagine the prompt did the trick as it was literally the same image, downloaded from paperless then reuploaded). I'll try more and see, but it seems to be fixed for now.

thanks for the quick response.

@icereed
Copy link
Owner

icereed commented Jan 23, 2025

Awesome! Could you share the prompt that you now used? Maybe it makes sense to include it into the default prompt.

@icereed icereed closed this as completed Jan 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants