Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significantly reduce per-frame memory allocations from the heap in the Mobile renderer #103794

Merged
merged 1 commit into from
Mar 9, 2025

Conversation

clayjohn
Copy link
Member

@clayjohn clayjohn commented Mar 8, 2025

The aim of this PR is to reduce memory allocations, not necessarily to increase performance. But it does have a nice performance increase as well.

In last week's core meeting we agreed that we should work towards reducing per-frame memory allocations as much as possible. Both for performance and stability.

Results using the Legend of the Nuku Warriors demo with tons of omnilights.

Before: ~95,000 allocations per second: 31-32 mspf
Screenshot (405)

After: ~45,000 allocations per second: 29-30 mspf
Screenshot (409)

The most important changes here are:

  1. Use a thread_local LocalVector instead of a Vector. This removes the allocations from the frequent use of push_back()
  2. Cache uniform sets instead of recreating every frame. uniform_set_create() allocates a lot of memory.

There is a lot more that can be done. But this PR contains a very safe, very impactful set of optimizations. So I would prefer to merge this quickly and then to move on to the other places.

@clayjohn clayjohn added this to the 4.5 milestone Mar 8, 2025
@clayjohn clayjohn requested a review from a team as a code owner March 8, 2025 06:53
u.binding = 6;
u.uniform_type = RD::UNIFORM_TYPE_TEXTURE;
Vector<RID> textures;
textures.resize(scene_state.max_lightmaps * 2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this function is called a lot, you probably want to use a stack fixed array. Basically a fancy fixed C array with all the usual machinery for push_back etc.

I have a basic one I wrote for 3.x (core/fixed_array.h) but you can equally well modify LocalVector template to be capable of storing on the stack (let me know if this is of interest, I recently did this for a third party module).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see it's getting pass to RD::Uniform below and stored there in which case maybe this allocation is unavoidable. 🙁

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the better option here is just to track the textures that may have changed and only recreate the Uniforms array when we know there will be a change.

Basically this function recreates the array of Uniforms every frame and then indexes into a hash map to see if we have cached this uniform set or not. Instead, with minimal tracking, we can just check if any Uniform actually changed, then only run this code when it has. Realistically, this won't run most frames.

But that is a riskier change to make, so I'd like to do it in a follow up PR as this gives us 99% of the benefit and is totally safe

@lawnjelly
Copy link
Member

lawnjelly commented Mar 8, 2025

I'm seeing a lot of thread_local LocalVector<RD::Uniform> uniforms; in multiple functions.

Is it possible to just create one and share it?

Yes, thinking about it, in the longterm, I suspect that if such a Vector is thread_local, then by definition you only need one per type (unless you are using in a nested fashion .. recursive wouldn't work at all for this).

This suggests longterm having a file somewhere in e.g. the renderer with a bunch of these thread_local vectors and reusing them. Otherwise we are just wasting memory and perhaps thrashing the cache unnecessarily.

Copy link
Member

@lawnjelly lawnjelly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super familiar with the renderer in 4.x but this looks ok to me from quick look.

Obviously as you say there are more improvements to come, but perfect is the enemy of good enough as they say, and it's an incremental thing.

@clayjohn
Copy link
Member Author

clayjohn commented Mar 8, 2025

Obviously as you say there are more improvements to come, but perfect is the enemy of good enough as they say, and it's an incremental thing.

I agree. Long term we need to weigh the costs of making more drastic changes with how much benefit we actually get. The renderer is now doing about 20,000 allocations per second with this test scene. That means our upper bound for improvement is half of what I just did. So a very drastic solution may not be warranted. Most of the remaining allocations come from one specific function too (draw_list_begin). Once we address that case it may not be beneficial to fix everything else

That being said, I agree it's worth investigating how we can avoid using thread_local vectors everywhere. That comes with its own cost

@Repiteo Repiteo merged commit f42565c into godotengine:master Mar 9, 2025
20 checks passed
@Repiteo
Copy link
Contributor

Repiteo commented Mar 9, 2025

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants