Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ocrmypdf fails due to Tesseract failed to report available languages #2504

Closed
pschichtel opened this issue Feb 20, 2024 · 61 comments
Closed
Labels
docker All things regarding docker setup

Comments

@pschichtel
Copy link
Contributor

I'm on version 0.41.0 and I just noticed that I can't select text in my imported PDF (a scanned document).

Looking at the job log I found this:

Tue, February 20th, 2024, 21:03: Running external command: ocrmypdf -l deu --skip-text --deskew -j 1 /tmp/docspell-convert/docspell-ocrmypdf17124542125895878539/infile /tmp/docspell-convert/docspell-ocrmypdf17124542125895878539/out.pdf
Tue, February 20th, 2024, 21:03: Command `ocrmypdf -l deu --skip-text --deskew -j 1 /tmp/docspell-convert/docspell-ocrmypdf17124542125895878539/infile /tmp/docspell-convert/docspell-ocrmypdf17124542125895878539/out.pdf` finished: 3
Tue, February 20th, 2024, 21:03: ocrmypdf stdout:
Tue, February 20th, 2024, 21:03: ocrmypdf stderr: Tesseract failed to report available languages. Output from Tesseract: ----------- [DS] Profile file not available (tesseract_opencl_profile_devices.dat); performing profiling. [DS] Device: "(null)" (Native) evaluation... Error in pixCloseBrick: pixs not 1 bpp Error in pixOpenBrick: pixs not defined Error in pixSubtract: pixs1 not defined Error in pixOpenBrick: pixs not defined Error in pixOpenBrick: pixs not defined [DS] Device: "(null)" (Native) evaluated [DS] composeRGBPixel: 0.017794 (w=1.2) [DS] HistogramRect: 0.015793 (w=2.4) [DS] ThresholdRectToPix: 0.025850 (w=4.5) [DS] getLineMasksMorph: 0.000040 (w=5.0) [DS] Score: 0.175782 [DS] Scores written to file (tesseract_opencl_profile_devices.dat). [DS] Device[1] 0:(null) score is 0.175782 [DS] Selected Device[1]: "(null)" (Native) List of available languages in "/usr/share/tessdata/" (23): ces dan deu est fin fra heb ita jpn jpn_vert khm lav lit nld nor pol por ron rus slk spa swe ukr
Tue, February 20th, 2024, 21:03: PDF conversion failed: Command result=3. No output file found.. Go without PDF file
Tue, February 20th, 2024, 21:03: Closing process: `ocrmypdf -l deu --skip-text --deskew -j 1 /tmp/docspell-convert/docspell-ocrmypdf17124542125895878539/infile /tmp/docspell-convert/docspell-ocrmypdf17124542125895878539/out.pdf`

I don't think I ever saw this error when importing my ~1000 documents.

@pschichtel
Copy link
Contributor Author

Text extraction and auto tagging seems to work fine, it's just the pdf with selectable text is missing, which I like to use.

@eikek
Copy link
Owner

eikek commented Feb 20, 2024

Thanks for reporting! Is this the docker setup? or how do you run docspell?

@tenpai-git
Copy link
Contributor

tenpai-git commented Feb 21, 2024

Hi @pschichtel - it looks like according to your output that the PDF conversion failed, and you said that you've scanned many documents successfully before.

It sounds like this is an issue specifically with converting this document. I encountered a similar issue.

For scanning this PDF, let's try editing your configuration a bit. In the /etc/docspell-joex/docspell-joex.confconfig try adding "--output-type", "pdf", to the options (this should come after --skip-text) and then go ahead and restart docspell-joex.

     # The `--skip-text` option is necessary to not fail on "text" pdfs
    # (where ocr is not necessary). In this case, the pdf will be
    # converted to PDF/A.
    ocrmypdf = {
      enabled = true
      command = {
        program = "ocrmypdf"
        args = [
          "-l", "{{lang}}",
          "--skip-text",
          "--deskew",
	  "--output-type", "pdf",
          "-j", "1",
          "{{infile}}",
          "{{outfile}}"
        ]

After editing so it appears similar to the excerpt above, restart docspell-joex.

sudo systemctl restart docspell-joex or use an equivalent command if on docker.

Try rescanning in the document and see if this line from your failed job disappears:
Tue, February 20th, 2024, 21:03: PDF conversion failed: Command result=3. No output file found.. Go without PDF file

Let us know if that worked. It would be good to know if using "--output-type", "pdf", was a better default than PDF/A. Similar to #2486

@pschichtel
Copy link
Contributor Author

Thanks for reporting! Is this the docker setup? or how do you run docspell?

I'm running it in kubernetes with my own helm chart.

@tenpai-git I can check that later today. Just one thing I want to add: I don't think it's an issue with this specific document for 2 reasons:

  1. all scanned documents in my docspell are from the same exact scanner with the same configuration, so the documents should be identical format-wise.
  2. all new scanned documents recently are affected (I think it might be correlated to the docspell upgrade, but I'm not entirely sure)

Can I safely downgrade 0.41.0 to 0.40.0 ?

@tenpai-git
Copy link
Contributor

tenpai-git commented Feb 21, 2024

Thanks for getting back to me @pschichtel - it may certainly be the case that it's another element of docspell now being presented with these new reasons. I'm not sure about the downgrade, but why don't you install ocrmypdf locally and give it a try to see if we can isolate if this issue is related also?

ocrmypdf -l deu ./input_pdf.pdf ./output.pdf --output-type pdf

Depending on the file you might need to add --skip-text flag to the above command as well.

Try on both a known working previous document and the new document and see if there's any difference.

@tenpai-git
Copy link
Contributor

@pschichtel Rather than downgrade by using a previous database backup and the previous version, maybe try upgrading to nightly 0.4.2 version? I am using PostgreSQL and was actually noticing a similar problem with a couple pdfs as I was testing, but then upgrading resolved it for me.

Please let me know if the other test had any different results.

@pschichtel
Copy link
Contributor Author

pschichtel commented Feb 22, 2024

@tenpai-git I tried upgrading to nightly, that didn't change anything sadly. I tried the command from the job log on my system and that worked without issues. The version of ocrmypdf from the arch AUR is 16.1.1 while the version in the joex container is 15.4.2. I tried the same with docker.io/jbarlow83/ocrmypdf:v15.4.2 which also worked without issue, so it really seems like something is off with the joex image. I'll play around with the joex images.

@pschichtel
Copy link
Contributor Author

Sadly the joex container doesn't build for commits older than bb181f1 and that commit is already broken for me.

@pschichtel
Copy link
Contributor Author

pschichtel commented Feb 23, 2024

Ok I worked around the build issue and bisected the problem to 90972a0, which is the alpine image update. When I build the image from master with the base image changed to the previous alpine:3 (or the more specific alpine:3.19.1) it also works again.

So I assume some dependencies are somehow incompatible in alpine:edge. I doesn't seem like any of the directly installed alpine packages have any major releases/changes between 3.19.1 and edge.

@eikek
Copy link
Owner

eikek commented Feb 25, 2024

Hi @pschichtel and @tenpai-git thanks for taking a deep look here. I can tell that maintaining the docker images is a real pain for me. One mistake was to have alpine edge as the base image. Don't remember why that is, actually. I had many problems with ocrmypdf on alpine in general. I want now to pull that docker image stuff outside the repo, because I just don't have the time to hunt down these things so often. Another option I was thinking about is to provide docker images based on the nix setup. Anyways, perhaps as mentioned in #2502 (comment), it might be good to move kubernetes + docker to a separate repo, where people with better knowledge in that space can operate.

@eikek eikek added the docker All things regarding docker setup label Feb 25, 2024
@nekrondev
Copy link
Contributor

Ok I can ACK that the tesseract issue is happening within the current docker build because of error messages written to stdout starting with Error (see https://github.com/ocrmypdf/OCRmyPDF/blob/16ab4a8b4ec82175880f235953d99e9c5265b634/src/ocrmypdf/_exec/tesseract.py#L130).

Inside the joex container you get something like this calling the tesseract binary:

3486937fbd9c:~# tesseract --list-langs
[DS] Profile file not available (tesseract_opencl_profile_devices.dat); performing profiling.

[DS] Device: "(null)" (Native) evaluation...
Error in pixCloseBrick: pixs not 1 bpp
Error in pixOpenBrick: pixs not defined
Error in pixSubtract: pixs1 not defined
Error in pixOpenBrick: pixs not defined
Error in pixOpenBrick: pixs not defined
[DS] Device: "(null)" (Native) evaluated
[DS]          composeRGBPixel: 0.019891 (w=1.2)
[DS]            HistogramRect: 0.093792 (w=2.4)
[DS]       ThresholdRectToPix: 0.048494 (w=4.5)
[DS]        getLineMasksMorph: 0.000075 (w=5.0)
[DS]                    Score: 0.467569
[DS] Scores written to file (tesseract_opencl_profile_devices.dat).
[DS] Device[1] 0:(null) score is 0.467569
[DS] Selected Device[1]: "(null)" (Native)
List of available languages in "/usr/share/tessdata/" (23):
ces
dan
deu
est

Unfortunately there are lines starting with "Error" so that ocrmypdf things there is serious trouble getting the installed languages and aborts the OCR process.

@nekrondev
Copy link
Contributor

nekrondev commented Feb 28, 2024

If you run the ocrmypdf CLI manually inside the container for a second time the tesseract_opencl_profile_devices.dat written at the first execution (with the Error lines found at stdout) will finally do the processing.

Somehow the recent Alpine version of tesseract was compiled with opencl support. Now that feature first tries to locate GPU drivers and does some profiling. The result is written into the *.dat file (see https://github.com/tesseract-ocr/tesseract/blob/94bd98b7ef8e05319301a4879fbc10d11d68ebc7/src/opencl/openclwrapper.cpp#L2357).

However on a fresh Docspell run that profile is missing... tesseract starts looking for OpenCL... it find some errors... but eventually writes that damn *.dat file aborting the processing... that leads to ocrmypdf returning exit code 3 and aborting the mess it created. Docspell receives exit code 3 and aborts PDF/A processing.

To solve this issue:

  • Docspell needs to CWD into /tmp folder before executing the ocrmypdf CLI (see my temporary workaround down below) and add or create the required *.dat file inside the Dockerfile when building the image OR
  • docker image must use non-OpenCL compiled tesseract binary

Workaround

Patch ocrmypdf to change the working directory to /tmp where the *.dat file will be stored (i.e. outside the volatile convert directory Docspell is removing automatically).

#!/usr/bin/python3
# -*- coding: utf-8 -*-
import re
import sys
from ocrmypdf.__main__ import run
if __name__ == "__main__":
    from os import chdir
    chdir("/tmp")
    sys.argv[0] = re.sub(r"(-script\.pyw|\.exe)?$", "", sys.argv[0])
    sys.exit(run())

Replace /usr/bin/ocrmypdf with the above patched version. I added the chdir() method to change to /tmp folder.
The first execution of ocrmypdf from Docspell process will fail, but you'll find the *.dat file inside /tmp folder. Alternatively you can cd /tmp && tesseract --list-langs inside the container to create the required profile file before processing your scans by Docspell preventing first time failure.

Any further Docspell PDF/A calls will now find that *.dat file and processing will work as expected.
Beware that the /tmp folder is volatile and in case you'll going to re-create the vanilla container the *.dat file is lost and needs to be re-created on first PDF/A run (that will fail). So better patch, add the *.dat file and commit your changes as an updated local container image.

@pschichtel
Copy link
Contributor Author

a script executed from the entrypoint might be a reasonable place to put the initial tesseract --list-langs. On the other hand: wouldn't it be easier to just installed the necessary driver components in the container? I have a GPU available that I could use with this, might be interesting to try.

@nekrondev
Copy link
Contributor

Yea maybe that would speed up the OCR stuff.. however as I understand tesseract code the profile *.dat file will be loaded at startup and if not found a fresh one created... this on the other hand is a problem for Docspell as the converter temp folders (containing the initial *.dat file created) will be removed so the tesseract process will never start OCRing your file on the next run. In the end the *.dat file must be previously created (at container startup is a good idea) and kept outside volatile folders removed by Docspell. tesseract CLI for that matter simply looks inside the current working directory for the *.dat file that's why the patched ocrmypdf changes it to /tmp. Docspell I guess is CWDing into the random volatile /tmp/docspell-converter/... folder where the scan and it's OCRed output will be located.

@pschichtel
Copy link
Contributor Author

setting ENV TESSERACT_OPENCL_DEVICE=1 could also fix the issue. tesseract will still do its device profiling every time, but since an explicit choice is given by env it will not fail.

@pschichtel
Copy link
Contributor Author

opencl packages seem to be a mess on alpine, there is only really rusticl which doesn't seem to implement what tesseract requires.

@eikek If we don't know a reason for going with alpine:edge (given how old that change is, I assume what ever dependency update was desired is probably already released in alpine:3), can we just revert this commit until the whole community managed docker idea is implemented and "better" images are provided?

@nekrondev
Copy link
Contributor

setting ENV TESSERACT_OPENCL_DEVICE=1 could also fix the issue. tesseract will still do its device profiling every time, but since an explicit choice is given by env it will not fail.

I tried that and still the profile *.dat is required and aborts processing on first run...

114caa7b8fe6:/tmp# ocrmypdf -l deu --skip-text --deskew -j 1 test.pdf hello.pdf
Tesseract failed to report available languages.                                                                                                                                                                                __main__.py:69
Output from Tesseract:
-----------
[DS] Profile file not available (tesseract_opencl_profile_devices.dat); performing profiling.

[DS] Device: "(null)" (Native) evaluation...
Error in pixCloseBrick: pixs not 1 bpp
Error in pixOpenBrick: pixs not defined
Error in pixSubtract: pixs1 not defined
Error in pixOpenBrick: pixs not defined
Error in pixOpenBrick: pixs not defined
[DS] Device: "(null)" (Native) evaluated
[DS]          composeRGBPixel: 0.017175 (w=1.2)
[DS]            HistogramRect: 0.088656 (w=2.4)
[DS]       ThresholdRectToPix: 0.033539 (w=4.5)
[DS]        getLineMasksMorph: 0.000053 (w=5.0)
[DS]                    Score: 0.384571
[DS] Scores written to file (tesseract_opencl_profile_devices.dat).
[DS] Device[1] 0:(null) score is 0.384571
[DS] Selected Device[1]: "(null)" (Native)
[DS] Overriding Device Selection (TESSERACT_OPENCL_DEVICE=1, 1)
[DS] Overridden Device[1]: "(null)" (Native)
List of available languages in "/usr/share/tessdata/" (23):
ces

@eikek
Copy link
Owner

eikek commented Feb 28, 2024

@eikek If we don't know a reason for going with alpine:edge (given how old that change is, I assume what ever dependency update was desired is probably already released in alpine:3), can we just revert this commit until the whole community managed docker idea is implemented and "better" images are provided?

Sure! Whatever makes this part easier is a plus for me. I can't remember why it is set to alpine:edge. I can't think of a reason why I would do it. Perhaps it was some missing dependency/newer version.

If I understand your analysis, tesseract (now?) needs a separate file that it will create on a first run? If that is so, I think docspell needs to provide some kind of non-volatile cache place for such things. Of course, this is a bit unfortunate from docspells point of view, because it is now more tricky to maintain. Perhaps tesseract could be configured to use a specific directory.

@pschichtel
Copy link
Contributor Author

Sure! Whatever makes this part easier is a plus for me. I can't remember why it is set to alpine:edge. I can't think of a reason why I would do it. Perhaps it was some missing dependency/newer version.

The commit that switched to edge is from may last year. Alpine 3.19.1 is from end of last month, so I don't see any risk here. A patch release with the revert would be very appreciated.

If I understand your analysis, tesseract (now?) needs a separate file that it will create on a first run? If that is so, I think docspell needs to provide some kind of non-volatile cache place for such things. Of course, this is a bit unfortunate from docspells point of view, because it is now more tricky to maintain. Perhaps tesseract could be configured to use a specific directory.

It's weird. The alpine edge version of tesseract is 5.3.4 compared to 5.3.3 in stable, so just a patch version difference. Nothing in the diff seems related. Also the package build between stable and edge is basically identical (compare stable and edge). I took the 3.19.1 based container I built locally and just upgrade tesseract-ocr to edge and the problem started. I assume something in the build environment of tesseract changed with edge.

@nekrondev
Copy link
Contributor

I think the culprit is here: #2066 so it's not your fault @eikek :)
Renovate bot somehow managed to automerge a major release change from 3 to 20230329 at the time of the PR. This change, be it a Renovate bug or image retagging whatsoever, switched from the 3 branch to edge branch with further major edge branch updates.

Renovate docs tell that major release changes for Dockerfiles must be explicitly activated but the PR summary created by Renovate shows a different picture.

As @pschichtel suggested I think the best solution without any hacks would be to tag the docker base image using major.min.patch semantic version. In that case when you use FROM alphine:3.19.1 Renovate should only upgrade minor and patchlevel versions if we do believe the docs.

I compared some build logs for alpine 3 branch vs. edge branch related to tesseract package and there it seems that the stable branch didn't use the --enable-opencl flag vs. edge enabling it. Tesseract sadly has no option to disable opencl support so you have to decide at compile time if you want to use it or not. Once activated this becomes an issue for Docspell because tesseract creates that profile *.dat file in the current working directory. Again there is no option for tessarect CLI to change that profile location so ocrmypdf would need to change its CWD to allow tessarect to save its profile file inside the current working directory (thats what I patched with my ugly workaround).

At work my team is also using Renovate but do also pinning the docker image by hash just in case someone pushes a new image using the same version tag. This will be detected by Renovate and a PR with updated docker image fingerprint created. The idea behind pinning your image with a digest is to have immutable builds using the same exact base image used in previous builds even if the registry image was updated behind the scenes.

@eikek
Copy link
Owner

eikek commented Feb 29, 2024

I think the culprit is here: #2066 so it's not your fault @eikek :)

Ah amazing thank you for digging this out :) Still should know how these tools work ;-)

I totally agree to use a stable tag for the image! I didn't know that renovate would still update minor and patch, I think this is perfect then. I would be also fine with pinning it using a hash.

Thanks a lot to both of you for analysing this so well! 💯

@eikek
Copy link
Owner

eikek commented Feb 29, 2024

I'm thinking about creating a new docker image manually (0.41.0-1 or similar) with alpine:3 (or hash, whatever you prefer) and the other fix regarding tesseract language (#2479).

eikek added a commit that referenced this issue Feb 29, 2024
See #2504, alpine edge introduced a version of tesseract that is
problematic to use from within docspell
@eikek
Copy link
Owner

eikek commented Feb 29, 2024

I now changed the base image as suggested to 3.19.1 here - do you think this is enough to fix this immediate problem?

Also, I assume when alpine edge becomes stable, we need to deal with this dat file somehow right?

@eikek
Copy link
Owner

eikek commented Mar 2, 2024

I pushed new images, unfortunately my brain didn't work so well and I pushed under the same tags... I wanted to create different version, of course. Curious to see if that helps now.

@pschichtel
Copy link
Contributor Author

yep, it works again.

@pschichtel
Copy link
Contributor Author

Should we close this or keep it open to discuss solutions for the issue once it comes back in the future?

@tiborrr
Copy link
Contributor

tiborrr commented Sep 13, 2024

The images makes choices for the user on which languages are installed. Do we want to add more languages or is the default ok?

#2779

We can continue the discussion here.

@pschichtel
Copy link
Contributor Author

If no one has an issue with the resulting image size, we might as well install all languages automatically. I'd have to see how large the would be. alternatively some option could be added to install additional languages during startup, but those would be installed on every startup.

@tiborrr
Copy link
Contributor

tiborrr commented Sep 13, 2024

I don't have any issues with the issue being a little big bigger. If you update your docker image then docker compose will first pull the new image and will then replace the existing one with minimal to no downtime.

@eikek
Copy link
Owner

eikek commented Sep 14, 2024

I'm also fine with a bigger image size. the main intention for the docker images was to have a convenient start with docspell.

@pschichtel
Copy link
Contributor Author

Interesting finding: the ocrmypdf image is also broken once updated to alpine 3.20, which is not surprising given the issue is tesseract and not ocrmypdf. I'll create an issue over there to discuss this, because I assume this will eventually affect them too.

@pschichtel
Copy link
Contributor Author

pschichtel commented Sep 15, 2024

a first version of the new image is available at https://github.com/docspell/docker/pkgs/container/joex

it's based on ubuntu:24.10 and docspell 0.42.0 and installs all available tesseract packages ubuntu provides.

@tiborrr
Copy link
Contributor

tiborrr commented Sep 17, 2024

@pschichtel your link does not work (yet). At what branch did you make this change?

for future reference this is what OCRmyPDF says in it's alpine image.

Note: Alpine 3.20 builds tesseract with --enable-opencl, which is not
supported by anyone. OCRmyPDF is not compatible with Alpine 3.20.0
through 3.20.3. The Alpine issue should be fixed in 3.21.0. It is
not clear if 3.20.4+ will have the fix.

@pschichtel
Copy link
Contributor Author

@tiborrr unfortunately the package has been set to private automatically and I can't change that.

I also received a response on my issue at ocrmypdf:

ocrmypdf/OCRmyPDF#1395 (comment)

So our options for a quickfix here would be: switch to 3.19.* or switch to edge. My container uses ubuntu.

@eikek
Copy link
Owner

eikek commented Sep 21, 2024

For quickly fixing the docker images, I could do again a rebuild using alpine 3.19.1 (as this has been working) wdyt? maybe there is a new docker build for the next release then

@tiborrr
Copy link
Contributor

tiborrr commented Sep 21, 2024

That would the quickest and easiest fix I guess. You have my blessing to do so.

Then we can figure out a new strategy in the mean time

@pschichtel
Copy link
Contributor Author

@tiborrr the ubuntu-based image I referenced is accessible now

@tiborrr
Copy link
Contributor

tiborrr commented Sep 21, 2024

@pschichtel I will test Monday, I only have access to test servers during office hours 😅

@nekrondev
Copy link
Contributor

@pschichtel I tested your Ubuntu-based image today and can confirm that Tesseract works as expected (converting scans to text now works like a blitz).

@pschichtel
Copy link
Contributor Author

I've now also successfully re-processed a bunch of documents with my ubuntu-based container, which worked without issues. I think the container is a drop-in replacement.

@eikek
Copy link
Owner

eikek commented Oct 22, 2024

Oh I'm sorry, totally forgot about this issue 😞 Could still do the image update or just switch to the ubuntu based image?

@pschichtel
Copy link
Contributor Author

@eikek what do you mean? I switched to the Ubuntu-based image from https://github.com/docspell/docker and that works fine. The question is: Could this image become the "official" one?

@eikek
Copy link
Owner

eikek commented Oct 22, 2024

@pschichtel yes that is what I meant/asked. I'd like to remove the docker stuff from this repository in favor of docspell/docker. Do you think it is ready? Then we could update the docs etc.

@pschichtel
Copy link
Contributor Author

I guess I should first also migrate the restserver image so, that both images could be migrated. I'm also not sure what your stance is on hosting the images only on ghcr.io or if you also want them to exist on dockerhub, but then you'd have to provide credentials for that.

@eikek
Copy link
Owner

eikek commented Oct 22, 2024

I don't care at all. For me ghcr is totally fine. I wolud also give the dockerhub credentials, it doesn't matter to me. I think if you are not far from migrating the restserver image, then I'd wait and rather update the docs. But I'm also happy to generate another image for joex and upload it for the meantime.

@pschichtel
Copy link
Contributor Author

I can probably do the restserver image later this evening, for the beginning it would basically be more of the same. I have ideas for some changes, but those can wait (and could use some discussion).

@eikek
Copy link
Owner

eikek commented Oct 22, 2024

ok cool, but no rush, just take your time. it doesn't need to be this evening at all.

@pschichtel
Copy link
Contributor Author

I published the restserver image and my deployment is running on it

@tiborrr
Copy link
Contributor

tiborrr commented Nov 25, 2024

closing this issue as it works with the new image found in the docker repo at

https://github.com/docspell/docker

  • @pschichtel just saw you opened this issue. You can close it now :)

@pschichtel
Copy link
Contributor Author

true

@Fredo70
Copy link

Fredo70 commented Dec 24, 2024

@pschichtel, @eikek
Sorry if I reopen this issue. Is it possible to write a guide for a beginner like me on what I need to do to solve this?
Today (24.12.2024), I tried updating Docker Compose following the instructions. It is Docspell 0.42.0.

$ docker-compose down
$ docker-compose pull
$ docker-compose up --force-recreate --build -d

I am also getting this error. No idea what I need to do.

Many thanks

Di.. 24. Dezember 2024, 12:08: ============ Start processing BRW3C0AF361C657_000005.pdf ============
Di.. 24. Dezember 2024, 12:08: Checking for duplicate files
Di.. 24. Dezember 2024, 12:08: Creating new item with 1 attachment(s)
Di.. 24. Dezember 2024, 12:08: Creating item finished in 211 ms
Di.. 24. Dezember 2024, 12:08: Not an archive: application/pdf
Di.. 24. Dezember 2024, 12:08: Converting file Some(BRW3C0AF361C657_000005.pdf) (application/pdf) into a PDF
Di.. 24. Dezember 2024, 12:08: Storing input to file /tmp/docspell-convert/docspell-ocrmypdf3286238075987602331/infile for running ocrmypdf
Di.. 24. Dezember 2024, 12:08: Trying to read the PDF using 0 passwords
Di.. 24. Dezember 2024, 12:08: Running external command: ocrmypdf -l deu --skip-text --deskew -j 1 /tmp/docspell-convert/docspell-ocrmypdf3286238075987602331/infile /tmp/docspell-convert/docspell-ocrmypdf3286238075987602331/out.pdf
Di.. 24. Dezember 2024, 12:08: Waiting for command to terminate…
Di.. 24. Dezember 2024, 12:08: [ocrmypdf (err)]: Tesseract failed to report available languages.
Di.. 24. Dezember 2024, 12:08: [ocrmypdf (err)]: Output from Tesseract:
Di.. 24. Dezember 2024, 12:08: [ocrmypdf (err)]: -----------
Di.. 24. Dezember 2024, 12:08: [ocrmypdf (err)]: [DS] Profile file not available (tesseract_opencl_profile_devices.dat); performing profiling.
Di.. 24. Dezember 2024, 12:08: [ocrmypdf (err)]:
Di.. 24. Dezember 2024, 12:08: [ocrmypdf (err)]: [DS] Device: "(null)" (Native) evaluation...
Di.. 24. Dezember 2024, 12:08: [ocrmypdf (err)]: Error in pixCloseBrick: pixs not 1 bpp
Di.. 24. Dezember 2024, 12:08: [ocrmypdf (err)]: Error in pixOpenBrick: pixs not defined
Di.. 24. Dezember 2024, 12:08: [ocrmypdf (err)]: Error in pixSubtract: pixs1 not defined
Di.. 24. Dezember 2024, 12:08: [ocrmypdf (err)]: Error in pixOpenBrick: pixs not defined
Di.. 24. Dezember 2024, 12:08: [ocrmypdf (err)]: Error in pixOpenBrick: pixs not defined
Di.. 24. Dezember 2024, 12:08: [ocrmypdf (err)]: [DS] Device: "(null)" (Native) evaluated
Di.. 24. Dezember 2024, 12:08: [ocrmypdf (err)]: [DS] composeRGBPixel: 0.089021 (w=1.2)
Di.. 24. Dezember 2024, 12:08: [ocrmypdf (err)]: [DS] HistogramRect: 0.132683 (w=2.4)
Di.. 24. Dezember 2024, 12:08: [ocrmypdf (err)]: [DS] ThresholdRectToPix: 0.170560 (w=4.5)
Di.. 24. Dezember 2024, 12:08: [ocrmypdf (err)]: [DS] getLineMasksMorph: 0.000145 (w=5.0)
Di.. 24. Dezember 2024, 12:08: [ocrmypdf (err)]: [DS] Score: 1.193509
Di.. 24. Dezember 2024, 12:08: [ocrmypdf (err)]: [DS] Scores written to file (tesseract_opencl_profile_devices.dat).
Di.. 24. Dezember 2024, 12:08: [ocrmypdf (err)]: [DS] Device[1] 0:(null) score is 1.193509
Di.. 24. Dezember 2024, 12:08: [ocrmypdf (err)]: [DS] Selected Device[1]: "(null)" (Native)
Di.. 24. Dezember 2024, 12:08: [ocrmypdf (err)]: List of available languages in "/usr/share/tessdata/" (24):
Di.. 24. Dezember 2024, 12:08: [ocrmypdf (err)]: ces
Di.. 24. Dezember 2024, 12:08: [ocrmypdf (err)]: dan
Di.. 24. Dezember 2024, 12:08: [ocrmypdf (err)]: deu
Di.. 24. Dezember 2024, 12:08: [ocrmypdf (err)]: eng
Di.. 24. Dezember 2024, 12:08: [ocrmypdf (err)]: est
Di.. 24. Dezember 2024, 12:08: [ocrmypdf (err)]: fin
Di.. 24. Dezember 2024, 12:08: [ocrmypdf (err)]: fra
Di.. 24. Dezember 2024, 12:08: [ocrmypdf (err)]: heb
Di.. 24. Dezember 2024, 12:08: [ocrmypdf (err)]: ita
Di.. 24. Dezember 2024, 12:08: [ocrmypdf (err)]: jpn
Di.. 24. Dezember 2024, 12:08: [ocrmypdf (err)]: jpn_vert
Di.. 24. Dezember 2024, 12:08: [ocrmypdf (err)]: khm
Di.. 24. Dezember 2024, 12:08: [ocrmypdf (err)]: lav
Di.. 24. Dezember 2024, 12:08: [ocrmypdf (err)]: lit
Di.. 24. Dezember 2024, 12:08: [ocrmypdf (err)]: nld
Di.. 24. Dezember 2024, 12:08: [ocrmypdf (err)]: nor
Di.. 24. Dezember 2024, 12:08: [ocrmypdf (err)]: pol
Di.. 24. Dezember 2024, 12:08: [ocrmypdf (err)]: por
Di.. 24. Dezember 2024, 12:08: [ocrmypdf (err)]: ron
Di.. 24. Dezember 2024, 12:08: [ocrmypdf (err)]: rus
Di.. 24. Dezember 2024, 12:08: [ocrmypdf (err)]: slk
Di.. 24. Dezember 2024, 12:08: [ocrmypdf (err)]: spa
Di.. 24. Dezember 2024, 12:08: [ocrmypdf (err)]: swe
Di.. 24. Dezember 2024, 12:08: [ocrmypdf (err)]: ukr
Di.. 24. Dezember 2024, 12:08: [ocrmypdf (err)]:
Di.. 24. Dezember 2024, 12:08: [ocrmypdf (err)]:
Di.. 24. Dezember 2024, 12:08: PDF conversion failed: Command result=3. No output file found.. Go without PDF file
Di.. 24. Dezember 2024, 12:08: Closing process: `ocrmypdf -l deu --skip-text --deskew -j 1 /tmp/docspell-convert/docspell-ocrmypdf3286238075987602331/infile /tmp/docspell-convert/docspell-ocrmypdf3286238075987602331/out.pdf`
Di.. 24. Dezember 2024, 12:08: Starting text extraction for 1 files
Di.. 24. Dezember 2024, 12:08: Extracting text for attachment BRW3C0AF361C657_000005.pdf
Di.. 24. Dezember 2024, 12:08: Trying to strip text from pdf using pdfbox.
Di.. 24. Dezember 2024, 12:08: Stripped text from PDF is small (0). Trying with OCR.
Di.. 24. Dezember 2024, 12:08: Running external command: gs -dLastPage=10 -dNOPAUSE -dBATCH -dSAFER -sDEVICE=tiffscaled8 -sOutputFile=%d.tif -
Di.. 24. Dezember 2024, 12:08: [gs (out)]: GPL Ghostscript 10.03.1 (2024-05-02)
Di.. 24. Dezember 2024, 12:08: [gs (out)]: Copyright (C) 2024 Artifex Software, Inc. All rights reserved.
Di.. 24. Dezember 2024, 12:08: [gs (out)]: This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
Di.. 24. Dezember 2024, 12:08: [gs (out)]: see the file COPYING for details.
Di.. 24. Dezember 2024, 12:08: [gs (out)]: Processing pages 1 through 1.
Di.. 24. Dezember 2024, 12:08: [gs (out)]: Page 1
Di.. 24. Dezember 2024, 12:08: Waiting for command to terminate…
Di.. 24. Dezember 2024, 12:09: [gs (out)]:
Di.. 24. Dezember 2024, 12:09: Running external command: unpaper /tmp/docspell-extraction/extractpdf4353302714550348038/1.tif /tmp/docspell-extraction/extractpdf4353302714550348038/u-1.tif
Di.. 24. Dezember 2024, 12:09: Waiting for command to terminate…
Di.. 24. Dezember 2024, 12:09: [unpaper (out)]: Processing sheet #1: /tmp/docspell-extraction/extractpdf4353302714550348038/1.tif -> /tmp/docspell-extraction/extractpdf4353302714550348038/u-1.tif
Di.. 24. Dezember 2024, 12:10: [unpaper (err)]: [image2 @ 0x7fae2a502200] Encoder did not produce proper pts, making some up.
Di.. 24. Dezember 2024, 12:10: [unpaper (err)]: [image2 @ 0x7fae2a502200] The specified filename '/tmp/docspell-extraction/extractpdf4353302714550348038/u-1.tif' does not contain an image sequence pattern or a pattern is invalid.
Di.. 24. Dezember 2024, 12:10: [unpaper (err)]: [image2 @ 0x7fae2a502200] Use a pattern such as %03d for an image sequence or use the -update option (with -frames:v 1 if needed) to write a single image.
Di.. 24. Dezember 2024, 12:10: [unpaper (out)]:
Di.. 24. Dezember 2024, 12:10: [unpaper (err)]:
Di.. 24. Dezember 2024, 12:10: Running external command: tesseract u-1.tif stdout -l deu
Di.. 24. Dezember 2024, 12:10: [tesseract (err)]: Stream(..)
Di.. 24. Dezember 2024, 12:10: Waiting for command to terminate…
Di.. 24. Dezember 2024, 12:11: Closing process: `tesseract u-1.tif stdout -l deu`
Di.. 24. Dezember 2024, 12:11: Closing process: `unpaper /tmp/docspell-extraction/extractpdf4353302714550348038/1.tif /tmp/docspell-extraction/extractpdf4353302714550348038/u-1.tif`
Di.. 24. Dezember 2024, 12:11: Closing process: `gs -dLastPage=10 -dNOPAUSE -dBATCH -dSAFER -sDEVICE=tiffscaled8 -sOutputFile=%d.tif -`
...

@pschichtel pschichtel changed the title Ocrmypdf fails due to tesseract Ocrmypdf fails due to Tesseract failed to report available languages Dec 24, 2024
@pschichtel
Copy link
Contributor Author

pschichtel commented Dec 24, 2024

@Fredo70 the docker-compose is still using the dockerhub images, did you switch to the ghcr.io/docspell/* images? If not, please do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docker All things regarding docker setup
Projects
None yet
Development

No branches or pull requests

7 participants