-
Notifications
You must be signed in to change notification settings - Fork 6.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tests: dynamic_thread_stack: Fix multicore and incoherent cache #68311
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. How is the dynamic stack incoherent? The only incoherent writable memory should be static thread stacks.
Really the issue should be the opposite: we should want the dynamic stacks to be cached (to avoid a performance disaster), but they aren't because nothing in the system heap understands the intel_adsp caching weirdness. Didn't I remember that there was a patch a while back that made sure that we don't enable both KERNEL_COHERENCE and DYNAMIC_THREAD at the same time? While in theory they can be made to work, they really can't in a way that a practical system could ship.
@@ -111,6 +112,8 @@ ZTEST(dynamic_thread_stack, test_dynamic_thread_stack_alloc) | |||
} | |||
} | |||
|
|||
sys_cache_data_flush_all(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should add a comment explaining the specific coherence problems on intel_adsp, otherwise this looks like voodoo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is fair. Btw, just checked that this is not enough if the thread allocating these stacks switches to a different cpu before flush the cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually thinking more on this, I think I'm a -1. Workarounds like this simply aren't supposed to be needed. On incoherent xtensa platforms, you get uncached/coherent memory everywhere, for correctness.[1] You can get to cached memory if you insist, but that becomes an app design decision[2], and the app becomes responsible for validation.
I really don't think this test is be doing any of that, not having been written with an eye to the xtensa cache madness. If it managed to get its hands on incoherent memory then it's either breaking rules about the use of its own stack in a way that KERNEL_COHERENCE can't catch, or we have a bug somewhere else in the platform layer. Either way we should fix that elsewhere, not patch the test.
[1] With the sole exception of thread stack memory, for which there's a mildly complicated validation framework in place to catch mistakes like e.g. putting shared kernel data on the stack.
[2] e.g. the SOF application heap allows you to specify whether you want cached memory or not.
Why not ? That is not the system heap, but a heap defined in the application for this only purpose.
Agree, the application is responsible. In this case the application is the test itself isn't ? That is why the test is defining a heap in the cached region. The change though is not enough, if the thread that is allocating stacks changes cpu before flushing the cache we have a problem. One way to do this is pinning the thread that allocate stacks, spawn threads and release these resources.
That is not what we are doing here ? defining a heap in cached area and being responsible for it ? |
Again though: how is the test getting the cached memory to begin with? Everything the default linker gives you (except statically defined thread stacks) should be uncached. That's where the bug is. |
The heap is being defined in __incoherent section. https://github.com/zephyrproject-rtos/zephyr/blob/main/tests/kernel/threads/dynamic_thread_stack/src/main.c#L22 Though, I have a different approach. Move the heap to coherent memory and get the uncached pointer for the allocated stack. I know, that is taking advantage of the memory being mapped twice, but the alternative is really put the heap in incoherent memory and ensure that who is manipulating that heap is aware of that. Threads just using the allocate stack does not need any further change since the kernel expects stacks being cached. Please take a look in the new version. |
d141d38
to
e6b7318
Compare
@andyross that is an alternative to .. https://gist.github.com/ceolin/83009ab1938ca6134a218cdab99aa441 What you think ? Do you see a problem in the one in this pr ? I don't see a reason to disable dynamic thread stack when having incoherent memory , it seems to me that who is using it should be responsible to ensure that is using cached memory properly. |
@@ -114,10 +114,27 @@ ZTEST(dynamic_thread_stack, test_dynamic_thread_stack_alloc) | |||
/* spwan our threads */ | |||
for (size_t i = 0; i < N; ++i) { | |||
tflag[i] = false; | |||
#ifdef CONFIG_XTENSA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with #68140, this will not be needed :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was hoping for that :)
Ah, bingo. OK, that's the root cause. That won't work, because the heap stores its metadata internally and needless to say that's going to break badly if it's allowed to skew between CPUs. Heaps on intel_adsp can only work over uncached memory, SOF makes caching work "inside" the blocks by padding/aligning them to cache lines, converting to a cached pointer, and invalidating on free. It's a lot more work than just putting it in the right section. My preference would be to fix this by just removing that section assignment and eating the performance cost. Moving to a SOF-style heap would be possible but IMHO probably not worth it, especially given that the future of the platform is an MMU world where you wouldn't be using a sys_heap for stack allocation anyway (i.e. you'd just use a page or two and map them vs. trying to allocate the region contiguously). |
Yep got this the hard way :/ Btw, that can work if you ensure that the thread manipulating the heap does not change cpu.
Did you see any problem with the current change? In the current implementation the heap is in uncached memory, but for the stack buffer we get a cached pointer (just because the memory is mapped twice). If we have a target that does not map the memory twice it will need to be more careful with the memory. The buffer is page aligned so we can map it properly for userspace, the metadata, AFAIU, is out of these pages and shouldn't be accessible from the user thread. Am I missing something something ? |
Oh, sorry. I didn't look at the filename. The heap is in the test, not the subsystem code. Honestly can't you just revert the __incoherent bit? It's a test, we don't care about performance per se. As far as whether this particular fix is enough, my guess is no, actually. It might fix the specific problem seen, but in the general case any mutation of the heap metadata (i.e. any alloc or free) needs to be coherent, which means it's not enough to flush in one spot: every time you're about to change the heap you need to get every other CPU to invalidate the region that is going to be changed (which isn't really well-bounded, it can touch other chunks than just the ones you pass) and then flush that region from your own core. That's even harder than a SOF-style cached-blocks-inside-uncached-heap trick. Just not worth it, IMHO. Removing the __incoherent is the right thing. |
It is already removed :) And there is no flush, I'd figured out the problem. The other proposal I had was to have only one pinned thread manipulating the heap, but that is out of the scope of this test. I put the heap in the coherent memory, but I am using the cached region for the stack. |
* region but the architecture maps addressable RAM twice in | ||
* two different regions in a way that for any given pointer, | ||
* it is possible to convert it to/from a cached version. | ||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is closer but still not quite right: the trick only works if the stack size and alignment is padded to fit within an integer number of cache lines, with no overlap on either side that might be used (e.g. for heap metadata) from another core. Or alternatively (since the allocation is done in subsystem code) you can trim out the cache-aligned subrange of the stack memory and use that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. Thinking about it a little bit more, isn't it a general requirement for targets with incoherent memory ? I mean even for static stacks, they should be cache aligned right, otherwise when flushing them we may be overwriting adjacent data in other cores. I don't remember have seen this, will take a look. Or I am just wrong about it :)
e6b7318
to
98e1169
Compare
@andyross you were right, we should just the pay the price in dynamic stacks. Keeping it in cached regions is over complicated, it is not only about cache alignment but when userspace is enabled the stack address is used to track this kernel object, when we pass a cached pointer the kernel is not able to get the kernel object associated with, requiring a lot of additional checks in different places. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. Cache incoherency is awful. I'll wear the scars from the APL DSP SMP bringup for the rest of my life.
@ceolin please fix that typo |
When allowing dynamic thread stack allocation the stack may come from the heap in coherent memory, trying to use cached memory is over complicated because of heap meta data and cache line sizes. Also when userspace is enabled, stacks have to be page aligned and the address of the stack is used to track kernel objects. Signed-off-by: Flavio Ceolin <flavio.ceolin@intel.com>
When running it in a multicore and with incoherent cache environment it is possible that the thread allocating dynamic stacks is switched to a different cpu. In this situation further access to that memory (like when releasing resources) will be invalid. Signed-off-by: Flavio Ceolin <flavio.ceolin@intel.com>
98e1169
to
3ea82dd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refresh +1
When running it in a multicore and with incoherent cache environment it
is possible that the thread allocating dynamic stacks is switched to a
different cpu. In this situation further access to that memory (like
when releasing resources) will be invalid.
Fixes #67515