-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ocrmypdf fails due to Tesseract failed to report available languages
#2504
Comments
Text extraction and auto tagging seems to work fine, it's just the pdf with selectable text is missing, which I like to use. |
Thanks for reporting! Is this the docker setup? or how do you run docspell? |
Hi @pschichtel - it looks like according to your output that the PDF conversion failed, and you said that you've scanned many documents successfully before. It sounds like this is an issue specifically with converting this document. I encountered a similar issue. For scanning this PDF, let's try editing your configuration a bit. In the
After editing so it appears similar to the excerpt above, restart docspell-joex.
Try rescanning in the document and see if this line from your failed job disappears: Let us know if that worked. It would be good to know if using |
I'm running it in kubernetes with my own helm chart. @tenpai-git I can check that later today. Just one thing I want to add: I don't think it's an issue with this specific document for 2 reasons:
Can I safely downgrade 0.41.0 to 0.40.0 ? |
Thanks for getting back to me @pschichtel - it may certainly be the case that it's another element of docspell now being presented with these new reasons. I'm not sure about the downgrade, but why don't you install
Depending on the file you might need to add Try on both a known working previous document and the new document and see if there's any difference. |
@pschichtel Rather than downgrade by using a previous database backup and the previous version, maybe try upgrading to nightly 0.4.2 version? I am using PostgreSQL and was actually noticing a similar problem with a couple pdfs as I was testing, but then upgrading resolved it for me. Please let me know if the other test had any different results. |
@tenpai-git I tried upgrading to nightly, that didn't change anything sadly. I tried the command from the job log on my system and that worked without issues. The version of ocrmypdf from the arch AUR is |
Sadly the joex container doesn't build for commits older than bb181f1 and that commit is already broken for me. |
Ok I worked around the build issue and bisected the problem to 90972a0, which is the alpine image update. When I build the image from master with the base image changed to the previous So I assume some dependencies are somehow incompatible in alpine:edge. I doesn't seem like any of the directly installed alpine packages have any major releases/changes between 3.19.1 and edge. |
Hi @pschichtel and @tenpai-git thanks for taking a deep look here. I can tell that maintaining the docker images is a real pain for me. One mistake was to have alpine edge as the base image. Don't remember why that is, actually. I had many problems with ocrmypdf on alpine in general. I want now to pull that docker image stuff outside the repo, because I just don't have the time to hunt down these things so often. Another option I was thinking about is to provide docker images based on the nix setup. Anyways, perhaps as mentioned in #2502 (comment), it might be good to move kubernetes + docker to a separate repo, where people with better knowledge in that space can operate. |
Ok I can ACK that the tesseract issue is happening within the current docker build because of error messages written to stdout starting with Inside the joex container you get something like this calling the tesseract binary:
Unfortunately there are lines starting with "Error" so that |
If you run the Somehow the recent Alpine version of However on a fresh Docspell run that profile is missing... To solve this issue:
WorkaroundPatch #!/usr/bin/python3
# -*- coding: utf-8 -*-
import re
import sys
from ocrmypdf.__main__ import run
if __name__ == "__main__":
from os import chdir
chdir("/tmp")
sys.argv[0] = re.sub(r"(-script\.pyw|\.exe)?$", "", sys.argv[0])
sys.exit(run()) Replace Any further Docspell PDF/A calls will now find that *.dat file and processing will work as expected. |
a script executed from the entrypoint might be a reasonable place to put the initial |
Yea maybe that would speed up the OCR stuff.. however as I understand |
setting ENV |
opencl packages seem to be a mess on alpine, there is only really rusticl which doesn't seem to implement what tesseract requires. @eikek If we don't know a reason for going with alpine:edge (given how old that change is, I assume what ever dependency update was desired is probably already released in alpine:3), can we just revert this commit until the whole community managed docker idea is implemented and "better" images are provided? |
I tried that and still the profile *.dat is required and aborts processing on first run... 114caa7b8fe6:/tmp# ocrmypdf -l deu --skip-text --deskew -j 1 test.pdf hello.pdf
Tesseract failed to report available languages. __main__.py:69
Output from Tesseract:
-----------
[DS] Profile file not available (tesseract_opencl_profile_devices.dat); performing profiling.
[DS] Device: "(null)" (Native) evaluation...
Error in pixCloseBrick: pixs not 1 bpp
Error in pixOpenBrick: pixs not defined
Error in pixSubtract: pixs1 not defined
Error in pixOpenBrick: pixs not defined
Error in pixOpenBrick: pixs not defined
[DS] Device: "(null)" (Native) evaluated
[DS] composeRGBPixel: 0.017175 (w=1.2)
[DS] HistogramRect: 0.088656 (w=2.4)
[DS] ThresholdRectToPix: 0.033539 (w=4.5)
[DS] getLineMasksMorph: 0.000053 (w=5.0)
[DS] Score: 0.384571
[DS] Scores written to file (tesseract_opencl_profile_devices.dat).
[DS] Device[1] 0:(null) score is 0.384571
[DS] Selected Device[1]: "(null)" (Native)
[DS] Overriding Device Selection (TESSERACT_OPENCL_DEVICE=1, 1)
[DS] Overridden Device[1]: "(null)" (Native)
List of available languages in "/usr/share/tessdata/" (23):
ces |
Sure! Whatever makes this part easier is a plus for me. I can't remember why it is set to alpine:edge. I can't think of a reason why I would do it. Perhaps it was some missing dependency/newer version. If I understand your analysis, tesseract (now?) needs a separate file that it will create on a first run? If that is so, I think docspell needs to provide some kind of non-volatile cache place for such things. Of course, this is a bit unfortunate from docspells point of view, because it is now more tricky to maintain. Perhaps tesseract could be configured to use a specific directory. |
The commit that switched to edge is from may last year. Alpine 3.19.1 is from end of last month, so I don't see any risk here. A patch release with the revert would be very appreciated.
It's weird. The alpine edge version of tesseract is 5.3.4 compared to 5.3.3 in stable, so just a patch version difference. Nothing in the diff seems related. Also the package build between stable and edge is basically identical (compare stable and edge). I took the 3.19.1 based container I built locally and just upgrade tesseract-ocr to edge and the problem started. I assume something in the build environment of tesseract changed with edge. |
I think the culprit is here: #2066 so it's not your fault @eikek :) Renovate docs tell that major release changes for Dockerfiles must be explicitly activated but the PR summary created by Renovate shows a different picture. As @pschichtel suggested I think the best solution without any hacks would be to tag the docker base image using I compared some build logs for alpine 3 branch vs. edge branch related to tesseract package and there it seems that the stable branch didn't use the At work my team is also using Renovate but do also pinning the docker image by hash just in case someone pushes a new image using the same version tag. This will be detected by Renovate and a PR with updated docker image fingerprint created. The idea behind pinning your image with a digest is to have immutable builds using the same exact base image used in previous builds even if the registry image was updated behind the scenes. |
Ah amazing thank you for digging this out :) Still should know how these tools work ;-) I totally agree to use a stable tag for the image! I didn't know that renovate would still update minor and patch, I think this is perfect then. I would be also fine with pinning it using a hash. Thanks a lot to both of you for analysing this so well! 💯 |
I'm thinking about creating a new docker image manually (0.41.0-1 or similar) with alpine:3 (or hash, whatever you prefer) and the other fix regarding tesseract language (#2479). |
See #2504, alpine edge introduced a version of tesseract that is problematic to use from within docspell
I now changed the base image as suggested to 3.19.1 here - do you think this is enough to fix this immediate problem? Also, I assume when alpine edge becomes stable, we need to deal with this dat file somehow right? |
I pushed new images, unfortunately my brain didn't work so well and I pushed under the same tags... I wanted to create different version, of course. Curious to see if that helps now. |
yep, it works again. |
Should we close this or keep it open to discuss solutions for the issue once it comes back in the future? |
The images makes choices for the user on which languages are installed. Do we want to add more languages or is the default ok? We can continue the discussion here. |
If no one has an issue with the resulting image size, we might as well install all languages automatically. I'd have to see how large the would be. alternatively some option could be added to install additional languages during startup, but those would be installed on every startup. |
I don't have any issues with the issue being a little big bigger. If you update your docker image then docker compose will first pull the new image and will then replace the existing one with minimal to no downtime. |
I'm also fine with a bigger image size. the main intention for the docker images was to have a convenient start with docspell. |
Interesting finding: the ocrmypdf image is also broken once updated to alpine 3.20, which is not surprising given the issue is tesseract and not ocrmypdf. I'll create an issue over there to discuss this, because I assume this will eventually affect them too. |
a first version of the new image is available at https://github.com/docspell/docker/pkgs/container/joex it's based on ubuntu:24.10 and docspell 0.42.0 and installs all available tesseract packages ubuntu provides. |
@pschichtel your link does not work (yet). At what branch did you make this change? for future reference this is what OCRmyPDF says in it's alpine image.
|
@tiborrr unfortunately the package has been set to private automatically and I can't change that. I also received a response on my issue at ocrmypdf: ocrmypdf/OCRmyPDF#1395 (comment) So our options for a quickfix here would be: switch to 3.19.* or switch to edge. My container uses ubuntu. |
For quickly fixing the docker images, I could do again a rebuild using alpine 3.19.1 (as this has been working) wdyt? maybe there is a new docker build for the next release then |
That would the quickest and easiest fix I guess. You have my blessing to do so. Then we can figure out a new strategy in the mean time |
@tiborrr the ubuntu-based image I referenced is accessible now |
@pschichtel I will test Monday, I only have access to test servers during office hours 😅 |
@pschichtel I tested your Ubuntu-based image today and can confirm that Tesseract works as expected (converting scans to text now works like a blitz). |
I've now also successfully re-processed a bunch of documents with my ubuntu-based container, which worked without issues. I think the container is a drop-in replacement. |
Oh I'm sorry, totally forgot about this issue 😞 Could still do the image update or just switch to the ubuntu based image? |
@eikek what do you mean? I switched to the Ubuntu-based image from https://github.com/docspell/docker and that works fine. The question is: Could this image become the "official" one? |
@pschichtel yes that is what I meant/asked. I'd like to remove the docker stuff from this repository in favor of docspell/docker. Do you think it is ready? Then we could update the docs etc. |
I guess I should first also migrate the restserver image so, that both images could be migrated. I'm also not sure what your stance is on hosting the images only on ghcr.io or if you also want them to exist on dockerhub, but then you'd have to provide credentials for that. |
I don't care at all. For me ghcr is totally fine. I wolud also give the dockerhub credentials, it doesn't matter to me. I think if you are not far from migrating the restserver image, then I'd wait and rather update the docs. But I'm also happy to generate another image for joex and upload it for the meantime. |
I can probably do the restserver image later this evening, for the beginning it would basically be more of the same. I have ideas for some changes, but those can wait (and could use some discussion). |
ok cool, but no rush, just take your time. it doesn't need to be this evening at all. |
I published the restserver image and my deployment is running on it |
closing this issue as it works with the new image found in the docker repo at https://github.com/docspell/docker
|
true |
@pschichtel, @eikek
I am also getting this error. No idea what I need to do. Many thanks
|
Tesseract failed to report available languages
@Fredo70 the docker-compose is still using the dockerhub images, did you switch to the ghcr.io/docspell/* images? If not, please do so. |
I'm on version 0.41.0 and I just noticed that I can't select text in my imported PDF (a scanned document).
Looking at the job log I found this:
I don't think I ever saw this error when importing my ~1000 documents.
The text was updated successfully, but these errors were encountered: