You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PDFBox 2.x (using 2.0.29 right now here, 2.0.33 seems to have some Out of memory fixes!) works Ok for smaller/not super complex PDF files, but the fact that it keeps processing after a client closes a connection (timeouts, etc) means HEAP usage on parallel requests can be overwhelming, specially if you think about the normal IIIF use case in combination with a Viewer: viewer will request probably (Mirador/IABookreader) a ton of Thumbnails at the same time.
We are getting tons of
500 Internal Server Error
Java heap space
And our heap is large. Basically the issue is that the client cuts the request sooner on very large files but also that the current implementation uses a scratch file(if enabled) and a memory limit. Old ImageMagick used to at least free resources.
OutOfMemory issues are documented here : #198
Our current version in our project:
@glenrobson I might have to discuss this in our next call. I am still getting OOM errors with large-ish PDFs in a concurrent production environment. I also (maybe reading this incorrectly) think that the Memory settings we have right now don't really apply to the PDF Loader and reader anymore ..
Stream cache
PDFBox 3.0.x no longer uses a separate cache when reading a pdf, but still does for write operations. It introduces the interface org.apache.pdfbox.io.RandomAccessStreamCache to define a cache factory in a more flexible way.
Because the OOM exceptions are not caught, while debugging I can see they are basically logged at
ERROR e.i.l.c.r.ImageRepresentation - write(): Java heap space (which is the top/most wrapper).
One thing that caught my attention was the fact that for Streamed Resources (there is a not in the code) we consume the Stream Completely even if we are done using it to avoid closing the connection (seems like the idea is to avoid some type of remote error?) but from my understanding that also means that we don't close the Input Stream *buffer and basically even if PDFBox already has what it needs to process a Page, the Object is never released until it fills up. I don't know if that is a good practice (might be nice for images?)
What I will do next (not committing it) is to put Out of memory exception catch in every part I feel we should have a problem.. and figure out if PDFBox is actually the culprit or the fact that we keep streaming (in a separate thread!) or the actual Image Buffer output is the problem.
@adam-vessey you ok if I pick your brain later this week?
What?
PDFBox 2.x (using 2.0.29 right now here, 2.0.33 seems to have some Out of memory fixes!) works Ok for smaller/not super complex PDF files, but the fact that it keeps processing after a client closes a connection (timeouts, etc) means HEAP usage on parallel requests can be overwhelming, specially if you think about the normal IIIF use case in combination with a Viewer: viewer will request probably (Mirador/IABookreader) a ton of Thumbnails at the same time.
We are getting tons of
And our heap is large. Basically the issue is that the client cuts the request sooner on very large files but also that the current implementation uses a scratch file(if enabled) and a memory limit. Old ImageMagick used to at least free resources.
OutOfMemory issues are documented here : #198
Our current version in our project:
cantaloupe/pom.xml
Lines 171 to 175 in ac5af61
Migration Guide to Apache PDFBox 3.x (current 3.0.3): https://pdfbox.apache.org/3.0/migration.html
I can't promise this will actually solve all problems, but at least is a start. Assigning this to myself
The text was updated successfully, but these errors were encountered: