Significantly reduce per-frame memory allocations from the heap in the Mobile renderer #103794

clayjohn · 2025-03-08T06:53:31Z

The aim of this PR is to reduce memory allocations, not necessarily to increase performance. But it does have a nice performance increase as well.

In last week's core meeting we agreed that we should work towards reducing per-frame memory allocations as much as possible. Both for performance and stability.

Results using the Legend of the Nuku Warriors demo with tons of omnilights.

Before: ~95,000 allocations per second: 31-32 mspf

After: ~45,000 allocations per second: 29-30 mspf

The most important changes here are:

Use a thread_local LocalVector instead of a Vector. This removes the allocations from the frequent use of push_back()
Cache uniform sets instead of recreating every frame. uniform_set_create() allocates a lot of memory.

There is a lot more that can be done. But this PR contains a very safe, very impactful set of optimizations. So I would prefer to merge this quickly and then to move on to the other places.

…rer.

lawnjelly · 2025-03-08T09:01:11Z

servers/rendering/renderer_rd/forward_mobile/render_forward_mobile.cpp

-		u.binding = 6;
-		u.uniform_type = RD::UNIFORM_TYPE_TEXTURE;
+		Vector<RID> textures;
+		textures.resize(scene_state.max_lightmaps * 2);


If this function is called a lot, you probably want to use a stack fixed array. Basically a fancy fixed C array with all the usual machinery for push_back etc.

I have a basic one I wrote for 3.x (core/fixed_array.h) but you can equally well modify LocalVector template to be capable of storing on the stack (let me know if this is of interest, I recently did this for a third party module).

Ah I see it's getting pass to RD::Uniform below and stored there in which case maybe this allocation is unavoidable. 🙁

I think the better option here is just to track the textures that may have changed and only recreate the Uniforms array when we know there will be a change.

Basically this function recreates the array of Uniforms every frame and then indexes into a hash map to see if we have cached this uniform set or not. Instead, with minimal tracking, we can just check if any Uniform actually changed, then only run this code when it has. Realistically, this won't run most frames.

But that is a riskier change to make, so I'd like to do it in a follow up PR as this gives us 99% of the benefit and is totally safe

lawnjelly · 2025-03-08T09:04:57Z

I'm seeing a lot of thread_local LocalVector<RD::Uniform> uniforms; in multiple functions.

Is it possible to just create one and share it?

Yes, thinking about it, in the longterm, I suspect that if such a Vector is thread_local, then by definition you only need one per type (unless you are using in a nested fashion .. recursive wouldn't work at all for this).

This suggests longterm having a file somewhere in e.g. the renderer with a bunch of these thread_local vectors and reusing them. Otherwise we are just wasting memory and perhaps thrashing the cache unnecessarily.

lawnjelly

I'm not super familiar with the renderer in 4.x but this looks ok to me from quick look.

Obviously as you say there are more improvements to come, but perfect is the enemy of good enough as they say, and it's an incremental thing.

clayjohn · 2025-03-08T17:40:46Z

Obviously as you say there are more improvements to come, but perfect is the enemy of good enough as they say, and it's an incremental thing.

I agree. Long term we need to weigh the costs of making more drastic changes with how much benefit we actually get. The renderer is now doing about 20,000 allocations per second with this test scene. That means our upper bound for improvement is half of what I just did. So a very drastic solution may not be warranted. Most of the remaining allocations come from one specific function too (draw_list_begin). Once we address that case it may not be beneficial to fix everything else

That being said, I agree it's worth investigating how we can avoid using thread_local vectors everywhere. That comes with its own cost

Repiteo · 2025-03-09T14:11:39Z

Thanks!

Reduce per-frame memory allocations from the heap in the Mobile rende…

5efcd64

…rer.

clayjohn added enhancement topic:rendering performance labels Mar 8, 2025

clayjohn added this to the 4.5 milestone Mar 8, 2025

clayjohn requested a review from a team as a code owner March 8, 2025 06:53

lawnjelly reviewed Mar 8, 2025

View reviewed changes

lawnjelly approved these changes Mar 8, 2025

View reviewed changes

Repiteo merged commit f42565c into godotengine:master Mar 9, 2025
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Significantly reduce per-frame memory allocations from the heap in the Mobile renderer #103794

Significantly reduce per-frame memory allocations from the heap in the Mobile renderer #103794

clayjohn commented Mar 8, 2025

lawnjelly Mar 8, 2025

lawnjelly Mar 8, 2025

clayjohn Mar 8, 2025

lawnjelly commented Mar 8, 2025 •

edited

Loading

lawnjelly left a comment •

edited

Loading

clayjohn commented Mar 8, 2025

Repiteo commented Mar 9, 2025

Significantly reduce per-frame memory allocations from the heap in the Mobile renderer #103794

Significantly reduce per-frame memory allocations from the heap in the Mobile renderer #103794

Conversation

clayjohn commented Mar 8, 2025

lawnjelly Mar 8, 2025

Choose a reason for hiding this comment

lawnjelly Mar 8, 2025

Choose a reason for hiding this comment

clayjohn Mar 8, 2025

Choose a reason for hiding this comment

lawnjelly commented Mar 8, 2025 • edited Loading

lawnjelly left a comment • edited Loading

Choose a reason for hiding this comment

clayjohn commented Mar 8, 2025

Repiteo commented Mar 9, 2025

lawnjelly commented Mar 8, 2025 •

edited

Loading

lawnjelly left a comment •

edited

Loading