Upgrade to PDFBox 3.x/refactor our implementation #722

DiegoPino · 2025-01-17T16:53:14Z

What?

PDFBox 2.x (using 2.0.29 right now here, 2.0.33 seems to have some Out of memory fixes!) works Ok for smaller/not super complex PDF files, but the fact that it keeps processing after a client closes a connection (timeouts, etc) means HEAP usage on parallel requests can be overwhelming, specially if you think about the normal IIIF use case in combination with a Viewer: viewer will request probably (Mirador/IABookreader) a ton of Thumbnails at the same time.

We are getting tons of

500 Internal Server Error
Java heap space

And our heap is large. Basically the issue is that the client cuts the request sooner on very large files but also that the current implementation uses a scratch file(if enabled) and a memory limit. Old ImageMagick used to at least free resources.

OutOfMemory issues are documented here : #198
Our current version in our project:

cantaloupe/pom.xml

Lines 171 to 175 in ac5af61

    
           <dependency> 
        
               <groupId>org.apache.pdfbox</groupId> 
        
               <artifactId>pdfbox</artifactId> 
        
               <version>2.0.29</version> <!-- v3 still in beta --> 
        
           </dependency>

Migration Guide to Apache PDFBox 3.x (current 3.0.3): https://pdfbox.apache.org/3.0/migration.html

I can't promise this will actually solve all problems, but at least is a start. Assigning this to myself

The text was updated successfully, but these errors were encountered:

DiegoPino · 2025-01-21T15:01:29Z

@glenrobson I might have to discuss this in our next call. I am still getting OOM errors with large-ish PDFs in a concurrent production environment. I also (maybe reading this incorrectly) think that the Memory settings we have right now don't really apply to the PDF Loader and reader anymore ..

From the Migration guide at https://pdfbox.apache.org/3.0/migration.html

read operations no longer use scratch files
...

Stream cache
PDFBox 3.0.x no longer uses a separate cache when reading a pdf, but still does for write operations. It introduces the interface org.apache.pdfbox.io.RandomAccessStreamCache to define a cache factory in a more flexible way.

Because the OOM exceptions are not caught, while debugging I can see they are basically logged at
ERROR e.i.l.c.r.ImageRepresentation - write(): Java heap space (which is the top/most wrapper).

One thing that caught my attention was the fact that for Streamed Resources (there is a not in the code) we consume the Stream Completely even if we are done using it to avoid closing the connection (seems like the idea is to avoid some type of remote error?) but from my understanding that also means that we don't close the Input Stream *buffer and basically even if PDFBox already has what it needs to process a Page, the Object is never released until it fills up. I don't know if that is a good practice (might be nice for images?)

What I will do next (not committing it) is to put Out of memory exception catch in every part I feel we should have a problem.. and figure out if PDFBox is actually the culprit or the fact that we keep streaming (in a separate thread!) or the actual Image Buffer output is the problem.

@adam-vessey you ok if I pick your brain later this week?

adam-vessey · 2025-01-21T16:59:46Z

@DiegoPino : PDFBox in Cantaloupe isn't something I'm terribly familiar with, but yeah, could probably find some cycles.

DiegoPino added the enhancement label Jan 17, 2025

DiegoPino added this to the 6.0 milestone Jan 17, 2025

DiegoPino self-assigned this Jan 17, 2025

DiegoPino mentioned this issue Jan 19, 2025

ISSUE-722: PDFBox 3.0.x #723

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade to PDFBox 3.x/refactor our implementation #722

Upgrade to PDFBox 3.x/refactor our implementation #722

DiegoPino commented Jan 17, 2025

DiegoPino commented Jan 21, 2025

adam-vessey commented Jan 21, 2025

Upgrade to PDFBox 3.x/refactor our implementation #722

Upgrade to PDFBox 3.x/refactor our implementation #722

Comments

DiegoPino commented Jan 17, 2025

What?

DiegoPino commented Jan 21, 2025

adam-vessey commented Jan 21, 2025