Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to PDFBox 3.x/refactor our implementation #722

Open
DiegoPino opened this issue Jan 17, 2025 · 2 comments
Open

Upgrade to PDFBox 3.x/refactor our implementation #722

DiegoPino opened this issue Jan 17, 2025 · 2 comments
Assignees
Milestone

Comments

@DiegoPino
Copy link
Contributor

What?

PDFBox 2.x (using 2.0.29 right now here, 2.0.33 seems to have some Out of memory fixes!) works Ok for smaller/not super complex PDF files, but the fact that it keeps processing after a client closes a connection (timeouts, etc) means HEAP usage on parallel requests can be overwhelming, specially if you think about the normal IIIF use case in combination with a Viewer: viewer will request probably (Mirador/IABookreader) a ton of Thumbnails at the same time.

We are getting tons of

500 Internal Server Error
Java heap space

And our heap is large. Basically the issue is that the client cuts the request sooner on very large files but also that the current implementation uses a scratch file(if enabled) and a memory limit. Old ImageMagick used to at least free resources.

OutOfMemory issues are documented here : #198
Our current version in our project:

cantaloupe/pom.xml

Lines 171 to 175 in ac5af61

<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0.29</version> <!-- v3 still in beta -->
</dependency>

Migration Guide to Apache PDFBox 3.x (current 3.0.3): https://pdfbox.apache.org/3.0/migration.html

I can't promise this will actually solve all problems, but at least is a start. Assigning this to myself

@DiegoPino DiegoPino added this to the 6.0 milestone Jan 17, 2025
@DiegoPino DiegoPino self-assigned this Jan 17, 2025
@DiegoPino
Copy link
Contributor Author

@glenrobson I might have to discuss this in our next call. I am still getting OOM errors with large-ish PDFs in a concurrent production environment. I also (maybe reading this incorrectly) think that the Memory settings we have right now don't really apply to the PDF Loader and reader anymore ..

From the Migration guide at https://pdfbox.apache.org/3.0/migration.html

read operations no longer use scratch files
...

Stream cache
PDFBox 3.0.x no longer uses a separate cache when reading a pdf, but still does for write operations. It introduces the interface org.apache.pdfbox.io.RandomAccessStreamCache to define a cache factory in a more flexible way.

Because the OOM exceptions are not caught, while debugging I can see they are basically logged at
ERROR e.i.l.c.r.ImageRepresentation - write(): Java heap space (which is the top/most wrapper).

One thing that caught my attention was the fact that for Streamed Resources (there is a not in the code) we consume the Stream Completely even if we are done using it to avoid closing the connection (seems like the idea is to avoid some type of remote error?) but from my understanding that also means that we don't close the Input Stream *buffer and basically even if PDFBox already has what it needs to process a Page, the Object is never released until it fills up. I don't know if that is a good practice (might be nice for images?)

What I will do next (not committing it) is to put Out of memory exception catch in every part I feel we should have a problem.. and figure out if PDFBox is actually the culprit or the fact that we keep streaming (in a separate thread!) or the actual Image Buffer output is the problem.

@adam-vessey you ok if I pick your brain later this week?

@adam-vessey
Copy link
Contributor

@DiegoPino : PDFBox in Cantaloupe isn't something I'm terribly familiar with, but yeah, could probably find some cycles.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants