Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests: dynamic_thread_stack: Fix multicore and incoherent cache #68311

Merged
merged 2 commits into from
Feb 9, 2024

Conversation

ceolin
Copy link
Member

@ceolin ceolin commented Jan 30, 2024

When running it in a multicore and with incoherent cache environment it
is possible that the thread allocating dynamic stacks is switched to a
different cpu. In this situation further access to that memory (like
when releasing resources) will be invalid.

Fixes #67515

dcpleung
dcpleung previously approved these changes Jan 30, 2024
Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. How is the dynamic stack incoherent? The only incoherent writable memory should be static thread stacks.

Really the issue should be the opposite: we should want the dynamic stacks to be cached (to avoid a performance disaster), but they aren't because nothing in the system heap understands the intel_adsp caching weirdness. Didn't I remember that there was a patch a while back that made sure that we don't enable both KERNEL_COHERENCE and DYNAMIC_THREAD at the same time? While in theory they can be made to work, they really can't in a way that a practical system could ship.

@@ -111,6 +112,8 @@ ZTEST(dynamic_thread_stack, test_dynamic_thread_stack_alloc)
}
}

sys_cache_data_flush_all();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add a comment explaining the specific coherence problems on intel_adsp, otherwise this looks like voodoo

Copy link
Member Author

@ceolin ceolin Jan 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is fair. Btw, just checked that this is not enough if the thread allocating these stacks switches to a different cpu before flush the cache.

Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually thinking more on this, I think I'm a -1. Workarounds like this simply aren't supposed to be needed. On incoherent xtensa platforms, you get uncached/coherent memory everywhere, for correctness.[1] You can get to cached memory if you insist, but that becomes an app design decision[2], and the app becomes responsible for validation.

I really don't think this test is be doing any of that, not having been written with an eye to the xtensa cache madness. If it managed to get its hands on incoherent memory then it's either breaking rules about the use of its own stack in a way that KERNEL_COHERENCE can't catch, or we have a bug somewhere else in the platform layer. Either way we should fix that elsewhere, not patch the test.

[1] With the sole exception of thread stack memory, for which there's a mildly complicated validation framework in place to catch mistakes like e.g. putting shared kernel data on the stack.

[2] e.g. the SOF application heap allows you to specify whether you want cached memory or not.

@ceolin
Copy link
Member Author

ceolin commented Jan 31, 2024

Hm. How is the dynamic stack incoherent? The only incoherent writable memory should be static thread stacks.

Really the issue should be the opposite: we should want the dynamic stacks to be cached (to avoid a performance disaster), but they aren't because nothing in the system heap understands the intel_adsp caching weirdness. Didn't I remember that there was a patch a while back that made sure that we don't enable both KERNEL_COHERENCE and DYNAMIC_THREAD at the same time? While in theory they can be made to work, they really can't in a way that a practical system could ship.

Why not ? That is not the system heap, but a heap defined in the application for this only purpose.

Actually thinking more on this, I think I'm a -1. Workarounds like this simply aren't supposed to be needed. On incoherent xtensa platforms, you get uncached/coherent memory everywhere, for correctness.[1] You can get to cached memory if you insist, but that becomes an app design decision[2], and the app becomes responsible for validation.

Agree, the application is responsible. In this case the application is the test itself isn't ? That is why the test is defining a heap in the cached region.

The change though is not enough, if the thread that is allocating stacks changes cpu before flushing the cache we have a problem. One way to do this is pinning the thread that allocate stacks, spawn threads and release these resources.

I really don't think this test is be doing any of that, not having been written with an eye to the xtensa cache madness. If it managed to get its hands on incoherent memory then it's either breaking rules about the use of its own stack in a way that KERNEL_COHERENCE can't catch, or we have a bug somewhere else in the platform layer. Either way we should fix that elsewhere, not patch the test.

[1] With the sole exception of thread stack memory, for which there's a mildly complicated validation framework in place to catch mistakes like e.g. putting shared kernel data on the stack.

[2] e.g. the SOF application heap allows you to specify whether you want cached memory or not.

That is not what we are doing here ? defining a heap in cached area and being responsible for it ?

@andyross
Copy link
Contributor

Again though: how is the test getting the cached memory to begin with? Everything the default linker gives you (except statically defined thread stacks) should be uncached. That's where the bug is.

@ceolin
Copy link
Member Author

ceolin commented Jan 31, 2024

Again though: how is the test getting the cached memory to begin with? Everything the default linker gives you (except statically defined thread stacks) should be uncached. That's where the bug is.

The heap is being defined in __incoherent section. https://github.com/zephyrproject-rtos/zephyr/blob/main/tests/kernel/threads/dynamic_thread_stack/src/main.c#L22

Though, I have a different approach. Move the heap to coherent memory and get the uncached pointer for the allocated stack. I know, that is taking advantage of the memory being mapped twice, but the alternative is really put the heap in incoherent memory and ensure that who is manipulating that heap is aware of that. Threads just using the allocate stack does not need any further change since the kernel expects stacks being cached.

Please take a look in the new version.

@ceolin
Copy link
Member Author

ceolin commented Jan 31, 2024

@andyross that is an alternative to ..

https://gist.github.com/ceolin/83009ab1938ca6134a218cdab99aa441

What you think ? Do you see a problem in the one in this pr ? I don't see a reason to disable dynamic thread stack when having incoherent memory , it seems to me that who is using it should be responsible to ensure that is using cached memory properly.

@ceolin ceolin changed the title tests: dynamic_thread_stack: Flush cache tests: dynamic_thread_stack: Fix multicore and inocherent cache Jan 31, 2024
@@ -114,10 +114,27 @@ ZTEST(dynamic_thread_stack, test_dynamic_thread_stack_alloc)
/* spwan our threads */
for (size_t i = 0; i < N; ++i) {
tflag[i] = false;
#ifdef CONFIG_XTENSA
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with #68140, this will not be needed :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was hoping for that :)

@andyross
Copy link
Contributor

The heap is being defined in __incoherent section. https://github.com/zephyrproject-rtos/zephyr/blob/main/tests/kernel/threads/dynamic_thread_stack/src/main.c#L22

Ah, bingo. OK, that's the root cause. That won't work, because the heap stores its metadata internally and needless to say that's going to break badly if it's allowed to skew between CPUs. Heaps on intel_adsp can only work over uncached memory, SOF makes caching work "inside" the blocks by padding/aligning them to cache lines, converting to a cached pointer, and invalidating on free. It's a lot more work than just putting it in the right section.

My preference would be to fix this by just removing that section assignment and eating the performance cost. Moving to a SOF-style heap would be possible but IMHO probably not worth it, especially given that the future of the platform is an MMU world where you wouldn't be using a sys_heap for stack allocation anyway (i.e. you'd just use a page or two and map them vs. trying to allocate the region contiguously).

@ceolin
Copy link
Member Author

ceolin commented Jan 31, 2024

The heap is being defined in __incoherent section. https://github.com/zephyrproject-rtos/zephyr/blob/main/tests/kernel/threads/dynamic_thread_stack/src/main.c#L22

Ah, bingo. OK, that's the root cause. That won't work, because the heap stores its metadata internally and needless to say that's going to break badly if it's allowed to skew between CPUs. Heaps on intel_adsp can only work over uncached memory, SOF makes caching work "inside" the blocks by padding/aligning them to cache lines, converting to a cached pointer, and invalidating on free. It's a lot more work than just putting it in the right section.

Yep got this the hard way :/ Btw, that can work if you ensure that the thread manipulating the heap does not change cpu.

My preference would be to fix this by just removing that section assignment and eating the performance cost. Moving to a SOF-style heap would be possible but IMHO probably not worth it, especially given that the future of the platform is an MMU world where you wouldn't be using a sys_heap for stack allocation anyway (i.e. you'd just use a page or two and map them vs. trying to allocate the region contiguously).

Did you see any problem with the current change? In the current implementation the heap is in uncached memory, but for the stack buffer we get a cached pointer (just because the memory is mapped twice). If we have a target that does not map the memory twice it will need to be more careful with the memory.

The buffer is page aligned so we can map it properly for userspace, the metadata, AFAIU, is out of these pages and shouldn't be accessible from the user thread. Am I missing something something ?

@andyross
Copy link
Contributor

andyross commented Feb 1, 2024

Oh, sorry. I didn't look at the filename. The heap is in the test, not the subsystem code. Honestly can't you just revert the __incoherent bit? It's a test, we don't care about performance per se.

As far as whether this particular fix is enough, my guess is no, actually. It might fix the specific problem seen, but in the general case any mutation of the heap metadata (i.e. any alloc or free) needs to be coherent, which means it's not enough to flush in one spot: every time you're about to change the heap you need to get every other CPU to invalidate the region that is going to be changed (which isn't really well-bounded, it can touch other chunks than just the ones you pass) and then flush that region from your own core. That's even harder than a SOF-style cached-blocks-inside-uncached-heap trick.

Just not worth it, IMHO. Removing the __incoherent is the right thing.

@ceolin
Copy link
Member Author

ceolin commented Feb 1, 2024

Oh, sorry. I didn't look at the filename. The heap is in the test, not the subsystem code. Honestly can't you just revert the __incoherent bit? It's a test, we don't care about performance per se.

As far as whether this particular fix is enough, my guess is no, actually. It might fix the specific problem seen, but in the general case any mutation of the heap metadata (i.e. any alloc or free) needs to be coherent, which means it's not enough to flush in one spot: every time you're about to change the heap you need to get every other CPU to invalidate the region that is going to be changed (which isn't really well-bounded, it can touch other chunks than just the ones you pass) and then flush that region from your own core. That's even harder than a SOF-style cached-blocks-inside-uncached-heap trick.

Just not worth it, IMHO. Removing the __incoherent is the right thing.

It is already removed :) And there is no flush, I'd figured out the problem. The other proposal I had was to have only one pinned thread manipulating the heap, but that is out of the scope of this test. I put the heap in the coherent memory, but I am using the cached region for the stack.

* region but the architecture maps addressable RAM twice in
* two different regions in a way that for any given pointer,
* it is possible to convert it to/from a cached version.
*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is closer but still not quite right: the trick only works if the stack size and alignment is padded to fit within an integer number of cache lines, with no overlap on either side that might be used (e.g. for heap metadata) from another core. Or alternatively (since the allocation is done in subsystem code) you can trim out the cache-aligned subrange of the stack memory and use that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Thinking about it a little bit more, isn't it a general requirement for targets with incoherent memory ? I mean even for static stacks, they should be cache aligned right, otherwise when flushing them we may be overwriting adjacent data in other cores. I don't remember have seen this, will take a look. Or I am just wrong about it :)

@ceolin ceolin force-pushed the tests/dynamic-thread branch from e6b7318 to 98e1169 Compare February 6, 2024 17:45
@ceolin
Copy link
Member Author

ceolin commented Feb 6, 2024

@andyross you were right, we should just the pay the price in dynamic stacks. Keeping it in cached regions is over complicated, it is not only about cache alignment but when userspace is enabled the stack address is used to track this kernel object, when we pass a cached pointer the kernel is not able to get the kernel object associated with, requiring a lot of additional checks in different places.

andyross
andyross previously approved these changes Feb 6, 2024
Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. Cache incoherency is awful. I'll wear the scars from the APL DSP SMP bringup for the rest of my life.

@nashif
Copy link
Member

nashif commented Feb 6, 2024

@ceolin please fix that typo

Flavio Ceolin added 2 commits February 6, 2024 20:35
When allowing dynamic thread stack allocation the stack may come from
the heap in coherent memory, trying to use cached memory is over
complicated because of heap meta data and cache line sizes.
Also when userspace is enabled, stacks have to be page aligned and the
address of the stack is used to track kernel objects.

Signed-off-by: Flavio Ceolin <flavio.ceolin@intel.com>
When running it in a multicore and with incoherent cache environment it
is possible that the thread allocating dynamic stacks is switched to a
different cpu. In this situation further access to that memory (like
when releasing resources) will be invalid.

Signed-off-by: Flavio Ceolin <flavio.ceolin@intel.com>
@ceolin ceolin force-pushed the tests/dynamic-thread branch from 98e1169 to 3ea82dd Compare February 6, 2024 20:36
@kartben kartben changed the title tests: dynamic_thread_stack: Fix multicore and inocherent cache tests: dynamic_thread_stack: Fix multicore and incoherent cache Feb 6, 2024
@fabiobaltieri fabiobaltieri added this to the v3.6.0 milestone Feb 7, 2024
Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refresh +1

@nashif nashif merged commit 36a497b into zephyrproject-rtos:main Feb 9, 2024
32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

intel_adsp: cavs25 tests/kernel/threads/dynamic_thread_stack/kernel.threads.dynamic_thread.stack* fails
7 participants