Move Tags Cache time into a parameter, handling of 50k+ documents #371

christianlouis · 2025-02-21T10:23:46Z

christianlouis
Feb 21, 2025
Sponsor

Hey, I really love the work you've done and I stumbled upon this project whilst being in the middle of building a similar solution myself to wrangle the existing library of scanned documents I've built over the last 10+ years - maybe it's a German thing ;) ?
One thing that I noticed is that the system slows down in processing once the number of tags grows. I allowed for custom tags to be generated and am now stuck with 6000 custom tags. Re-scanning the list of tags every three seconds seems to slow down the processing significantly.
I am currently experimenting with upping the tag cache time a little bit and would really appreciate some experience around that and how to speed up bulk processing.

Thanks a lot for everybody's input on this.

clusterzx · 2025-02-21T11:40:46Z

clusterzx
Feb 21, 2025
Maintainer

Yeah ... this is a huge amount of documents. Wir Deutschen lieben unsere Papiere und Ordnung 😆
Unfortunatly I can not directly address this "problem" as I do not have such an amount of data lying arround.
But just thinking about it, this is clearly a problem or improvment I have to address.
The thing is paperless-ai pulls the complete data strucuture for every document at once, so maybe it could be also a performance issue by the paperless-ngx database and api itself.

I will try some things out to reduce the data that is transferred on requests.

1 reply

christianlouis Feb 21, 2025
Author Sponsor

It seems that iterating over the tags to read them is slow for 6000 tags and yes, this is likely a Paperless performance issue...
Happy to share performance data or other ways to see how the system behaves against a larger body of documents :) Just let me know what to do to help you make this an even better solution.

Waly-de · 2025-02-21T13:00:27Z

Waly-de
Feb 21, 2025

Wouldn’t it be possible to fetch the tag cache only once at the beginning of the scan and then work with this “copy”?

For me, the time between
[DEBUG] Refreshing tag cache...
and
[DEBUG] Tag cache refreshed. Found 4579 tags.
now takes over 5 minutes… and consumes 30% CPU power during that time.
(Yes, the system is not a powerhouse.)

1 reply

clusterzx Feb 21, 2025
Maintainer

Wouldn’t it be possible to fetch the tag cache only once at the beginning of the scan and then work with this “copy”?

For me, the time between

[DEBUG] Refreshing tag cache...

and

[DEBUG] Tag cache refreshed. Found 4579 tags.

now takes over 5 minutes… and consumes 30% CPU power during that time.

(Yes, the system is not a powerhouse.)

No because paperless-ai would not know about newly created tags for further processing with new documents.

Waly-de · 2025-02-21T13:15:15Z

Waly-de
Feb 21, 2025

So in my case, the chances are higher that someone creates a new tag during the scanning process rather than during normal operation. And the TAGs your system generates you already know ;-)

In any case, this is quite a performance killer. I would be super happy if you could find a solution for this in the future.

And of course, a big thumbs up from me for your great work!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move Tags Cache time into a parameter, handling of 50k+ documents #371

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Move Tags Cache time into a parameter, handling of 50k+ documents #371

christianlouis Feb 21, 2025 Sponsor

Replies: 3 comments · 2 replies

clusterzx Feb 21, 2025 Maintainer

christianlouis Feb 21, 2025 Author Sponsor

Waly-de Feb 21, 2025

clusterzx Feb 21, 2025 Maintainer

Waly-de Feb 21, 2025

christianlouis
Feb 21, 2025
Sponsor

Replies: 3 comments 2 replies

clusterzx
Feb 21, 2025
Maintainer

christianlouis Feb 21, 2025
Author Sponsor

Waly-de
Feb 21, 2025

clusterzx Feb 21, 2025
Maintainer

Waly-de
Feb 21, 2025