Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error Generating Preview Images for PDF/A Files in docspell 0.41.0 #2486

Closed
ElektroCoder opened this issue Feb 4, 2024 · 12 comments
Closed
Labels
question Further information is requested stale

Comments

@ElektroCoder
Copy link

Hello,

I encountered an issue with processing PDF/A files in docspell version 0.41.0 on Debian 12. Attempting to generate preview images results in an error specifically for colored PDF/A files, whereas standard PDF files are processed without any issues. Here are the relevant log entries:

[...]
Sun, 4 February 2024, 19:45: Creating preview images for 1 files…
Sun, 4 February 2024, 19:45: Creating preview images failed, continuing without it.: LCMS error 13: Mismatched alpha channels
Sun, 4 February 2024, 19:45: Retrieving page count for 1 files…
[...]

I would greatly appreciate any assistance or suggestions on how to resolve this issue.

@eikek
Copy link
Owner

eikek commented Feb 4, 2024

Hi @ElektroCoder I probably need such a pdf to check it on my side. Do you perhaps have some test file without sensitive stuff? Do you know if the same file works in the/a previous version?

@ElektroCoder
Copy link
Author

ElektroCoder commented Feb 4, 2024 via email

@eikek
Copy link
Owner

eikek commented Feb 5, 2024

Hi, oh sure, there is absolutely no rush. Just take your time - however long that may take.

@ElektroCoder
Copy link
Author

ElektroCoder commented Feb 11, 2024

Hi,

sorry for the delay. I took some time to retest things after double-checking my AMD GPU drivers on Debian and reinstalling Docker and Docspell. I've got two PDFs for you, both scanned with a Brother ADS 2400N scanner. One is in (not working) PDF/A format and the other in standard PDF format. They were saved via a Samba share, which has been working smoothly.

I've never had any issues with Docspell 0.40.0 before. However, I recently upgraded my hardware from an old A3000 CPU to an AMD 5600G CPU, and I'm running everything on a Debian 12 terminal server.
the import process log has this entry:

[...]
Sun, February 11th, 2024, 10:32: Updating SOLR index
Sun, February 11th, 2024, 10:32: Text extraction finished in 46630 ms.
Sun, February 11th, 2024, 10:32: Creating preview images for 1 files…
Sun, February 11th, 2024, 10:32: Creating preview images failed, continuing without it.: LCMS error 13: Mismatched alpha channels
Sun, February 11th, 2024, 10:32: Retrieving page count for 1 files…
Sun, February 11th, 2024, 10:32: Found number of pages: 2
[...]

I'll include the log files as text files. I'm not sure what's causing the problem; everything seems to be functioning fine, and Portainer isn't showing any entries in the container logs.

Thanks for your help in advance.


failed_Scan_20240211_113131_004873.pdf
log_004873_failedPreview_Brother_ADS-2400N_PDF-A.txt
log_004875_workingPreview_Brother_ADS-2400N_PDF.txt
ok_Scan_20240211_113214_004875.pdf

@TheAnachronism
Copy link
Contributor

I just got the same error in generating the preview for a file.
I'm running docspell inside Kubernetes, but I don't think that's the issue.

@TheAnachronism
Copy link
Contributor

I also get this a bit before the preview fails:

Thu, February 15th, 2024, 16:45: PDF conversion failed: Command result=3. No output file found.. Go without PDF file

@tenpai-git
Copy link
Contributor

tenpai-git commented Feb 21, 2024

Hi @ElektroCoder @TheAnachronism

I read your output and also noticed that in the log.
Sun, February 11th, 2024, 11:28: PDF conversion failed: Command result=3. No output file found.. Go without PDF file

I the filenames of your working preview have PDF in the file name, and the failed preview has PDF/A in the file name.

This tells me that potentially PDF/A conversion is the culprit here.

Could you try the following? For scanning this PDF, let's try editing your ocrmypdf configuration a bit. In the /etc/docspell-joex/docspell-joex.confconfig try adding "--output-type", "pdf", to the options (this should come after --skip-text) and then go ahead and restart docspell-joex.

     # The `--skip-text` option is necessary to not fail on "text" pdfs
    # (where ocr is not necessary). In this case, the pdf will be
    # converted to PDF/A.
    ocrmypdf = {
      enabled = true
      command = {
        program = "ocrmypdf"
        args = [
          "-l", "{{lang}}",
          "--skip-text",
          "--deskew",
	  "--output-type", "pdf",
          "-j", "1",
          "{{infile}}",
          "{{outfile}}"
        ]

After editing so it appears similar to the excerpt above, restart docspell-joex.

sudo systemctl restart docspell-joex or use equivalent commands on docker.

Try reprocessing (delete the failed one, and any intermediary or cached filed created from scanning in the original document) and send the log over?

It would be good to know if using "--output-type", "pdf", was a better default than PDF/A. @eikek potentially similar to issue #2504 for affected PDFs.

PDF/A is meant to be archived as is, so even though it's counterintuitive since we want to manage documents, converting to raw PDF for processing may be better for Docspell.

@tenpai-git
Copy link
Contributor

tenpai-git commented Feb 22, 2024

Hey guys, maybe try upgrading to nightly 0.4.2 version? I don't use SOLR, I am using PostgreSQL, but my previews were not generating on certain things also.

I tried upgrading to nightly on whim, and that resolved it for me. Perhaps there is a dependency issue of some kind.

Curious to see if the other test suggested works out for you as well. Adding "--output-type", "pdf", as previously described fixed things in a lot of pdfs I was working with, including previews.

@eikek
Copy link
Owner

eikek commented Mar 2, 2024

Hi! I wonder if that issue is also related to #2504 (as mentioned already by @tenpai-git above). The docker images have been updated (sadly reusing the same tags as before) - maybe you could given them a try?

@eikek
Copy link
Owner

eikek commented Mar 2, 2024

@ElektroCoder I tested your "failed scan" document quickly at my 0.39.0 installation. It was all good. I have preview and can select text in the converted pdf. I would assume for now some tooling problems, because I don't recall any changes in code from that version to 0.41.0 in that area. (I'm not using the docker images)

@eikek eikek added the question Further information is requested label Mar 11, 2024
Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. This only applies to 'question' issues. Always feel free to reopen or create new issues. Thank you!

@github-actions github-actions bot added the stale label Apr 11, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 19, 2024
@vs49688
Copy link

vs49688 commented May 19, 2024

I just hit the same problem, and this workaround 1 fixes it - simply add -Dorg.apache.pdfbox.rendering.UsePureJavaCMYKConversion=true

Footnotes

  1. https://issues.apache.org/jira/browse/PDFBOX-5787

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested stale
Projects
None yet
Development

No branches or pull requests

5 participants