-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arm64: Add SVE/SVE2 support in .NET 9 #93095
Comments
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsOverviewAs we did in the past for .NET 5, .NET 7 and .NET 8, we would like to continue improving Arm64 in .NET 9 as well. Here are the top-level themes we plan to address. Some of the issues are from the past releases that we did not get time to work upon. while others are about adding instructions of newer arm versions or exposing the Arm functionality to the .NET API level. SVE2 supportUntil Armv8, NEON architecture enabled users to write vectorized code using SIMD instructions. The vector length for NEON instructions remains fixed at 128 bits. To take into account High Performance Computing applications, newer versions of Arm like v9 has developed Scalable Vector Extension (SVE). SVE is a new programming model that contains SIMD instructions operating on flexible vector length, ranging from 128 bits to 2048 bits. SVE2 extends SVE to enable domains like Computer vision, multimedia, etc. More details about SVE can be found on Arm's website. In .NET 9, we want to start the foundational work in .NET libraries and RyuJIT to add support for SVE2 feature.
New instructions
Performance improvements
Stretch goals
|
We should think about how this is going to work with the new SVE streaming mode. Do we expect to support the SVE streaming mode in .NET eventually? If yes, how it is going to affect the design? |
I'd presume it would be desirable to support. Giving users the power of the per hardware lightup is a significant advantage to many backing frameworks that power .NET applications.
The streaming instructions are more abstract and I don't think the design for C++ has been finalized yet either. I at least don't see any of the new instructions under https://developer.arm.com/architectures/instruction-sets/intrinsics I expect they will be a completely separate consideration for how they're supported as compared to the more standard SIMD processing instructions and we'll need to cross that bridge when we come to it. This is particularly given how it allows the effective SVE length to be changed and requires explicit start/stop instructions. And so I imagine the runtime itself will require some complex changes to work with the concept and ensure it doesn't impact the ABI, field usage, inlining boundaries, etc. |
My understanding that the streaming mode uses the same instructions as regular SVE mode (subset of them), except that the instructions operate on larger vector sizes. If we were to use the current This observation made me think that reusing |
That is effectively how C describes its SVE support. They are similar to "incomplete" types in many aspects, but with a few less restrictions. They can't be used with At least for non-streaming, a lot of that is only restrictive because C/C++ has existing requirements for things like If we did define a new type, I think we'd functionally be defining a
This is my understanding as well. For reference, the Procedure Call Standard` is here: https://github.com/ARM-software/abi-aa/releases/download/2023Q1/aapcs64.pdf While the
Then, when we're ready to take a deeper look at the feature, we can determine whether using |
Good point about AOT support for non-streaming SVE. How are we going to do that with |
It ultimately depends on the code the user writes and how much we want to invest. For recommended coding patterns, where the user treats If the user deviates from this and starts using We also have the option of generation |
The design that we choose has a large influence over the cost required to support it. If we choose a design that mirrors the C/C++ design, the cost of supporting SVE for AOT and for streaming mode is going to be very low. If we choose the current proposed design that uses existing
It is not unusual to see the
In the limit, this means creating a type loader that computes field layout at runtime and teaching codegen to use dynamic sizes and offsets produced by the type loader. We have done that in .NET Native for UWP, it was very complicated and produced poor results. I do not think we would ever want to do that again for form-factors without a JIT.
Yes, SVE is a nascent technology. It is reasonable to expect that there will implementations that take advantage of the full range of allowed lengths, with both streaming and non-streaming modes. We should not take shortcuts based on what is available today. |
The TL;DR is I think the evidence of reusing
I agree. It also dictates the ease at which it can be integrated into existing SIMD algorithms, used in shared coding patterns, light up implicitly around non-SVE based code, and the cost that is moved out of the runtime and onto other tools such as analyzers or the language, etc.
Possibly, but in turn I believe the cost of supporting the feature in general and the cost of integrating it into the rest of the ecosystem is going to be significantly higher. A net new type, with the level of restrictions being discussed, requires the VM and JIT to block and fail for any number of the problematic patterns. It likely also requires a complex analyzer or language level feature (it is the latter in C/C++) to direct users towards "the right thing", since the wrong thing will cause true failures at runtime. That then extends to considerations beyond C# and impacts other languages where intrinsics can be used (such as F#) where they then need to do the same/similar work as well. A net new type means significant increase to the existing API surface area and less integration with existing features or algorithms. Given the restrictions it has, it would be much like A net new type means that it is harder to implicitly light up existing code paths. It means that a computer with 512-bit SVE might not be able to successfully use All of this raises the complexity bar significantly higher, introduces more risk towards being able to successfully ship the feature, and reduces the chance of adoption. Particularly if it was a language feature, it would likewise take time and effort away from other more important features which would be more broadly used and applicable. On the other hand, reusing
I agree that supporting the full range of lengths is desirable. I disagree that only supporting a subset today is taking a shortcut. We have up to 2 sizes (128 and 256) that "need" support today because they have real world hardware that .NET will likely run on. We then have 1 additional size (512), with real hardware today, that could theoretically be supported. But that is only if we expect to run on a super computer. We then have 2 more sizes (1024 and 2048) that could theoretically be supported in the next 3 years, but which is incredibly unlikely to be supported in the .NET 9 timeframe. Finally, we then have 11 types (other multiples of 128) which are technically supported by SVE, but which are disallowed by SVE2. Such support doesn't ever need to be provided since it was optional and I'm unaware of any hardware that actually shipped such support. The change in SVE2 has effectively deprecated it as well, which decreases value further. It is then, in my mind, completely reasonable to prioritize the existing support in the first iteration of the implementation and to limit any apps produced for it to those sizes. After all, .NET 9 only needs to consider hardware that will reasonably be targeted in the next 30 months (12 months till ship + 18 months of support), including Android/iOS. Limiting the feature to just support 128-bit and 256-bit (for the one piece of hardware that currently supports that) should then be completely fine and allows us to spread the more involved work out over future releases when and where it becomes relevant.
I don't see this as the case given the reasons above. The base AOT work required is effectively the same regardless. The main difference is with a net new type we can simply choose to have the VM throw a type load exception if an SVE based vector is used "incorrectly", rather than emit the slower codegen required to support Even supporting things like locals functionally needs a lot of the same support around offset calculations and the like. For example, taking the pointer to an SVE based vector is still allowed, as is dereferencing it, simply not doing pointer arithmetic on it (that is for #include <arm_sve.h>
extern void M(int* px, svint32_t* py, svint32_t* pz, int* pw);
void N()
{
int x;
svint32_t y, z;
int w;
M(&x, &y, &z, &w);
} This code, as can be seen on https://godbolt.org/z/6qdz71G8q, then requires you to effectively dynamically allocate the stack space in the prologue, to create space for The only real "additional" complexity around supporting There is likewise nothing requiring we do all the work at once. AOT is more limited than a JIT environment and we correspondingly already have features that don't work end to end or which may be more limited in the former. Finally, by having this support be around |
Byref-like types provide the restrictions that we would need here. There may be a few additional ones around taking size depending on the exact design. It should very straightforward enforce at runtime. It should not require analyzers or language support, exceptions thrown at runtime should be enough.
I see implicit light-up of existing Vector paths as independent problem from SVE-specific Intrinsics. The light-up of existing architecture-neutral Vector paths can work the same way as AVX512 light-up, I do not see a problem with that. We have been designing the AVX512 light up for existing My concern is specifically about the type used with architecture specific SVE instructions. I do not think that it is appropriate to have this type to be configurable. This type should always match the SVE bitness of the underlying platform (and current streaming mode). It should not be user configurable.
I do not see why the new type for SVE specific instructions significantly increases the existing API surface. We are talking about adding a ton of new SVE specific Intrinsics. A new supporting type for them sounds like a drop in a bucket. I agree that a new type means that you need to convert from/to the new type if you mix and match platform specific and platform neutral methods. It is a problem today as well if you mix and match
This is only true if we have good understanding of what we are going to do in the future releases. We are discussing many options here. We should have a firm plan that we agree on as viable.
This is not the interesting complex case. This can be all handled by codegen as you have said and it does not require runtime type loader. The interesting complex cases are classes, statics and generics like
We have an option to make the new type to behave same as |
I think it would be good if we had a meeting to discuss this more in depth. I want to make sure we are at least on the same page with regards to eachothers concerns and the impact they would have on the ecosystem. From my perspective, we have something today that works and makes .NET one of the best places to write SIMD code, particularly when it comes to supporting multiple platforms. Regular SVE is a new feature that meshes very well with the existing conventions and is the mainline scenario for Arm64 moving forward. There exists similar functionality designed for other platforms that also make the approach viable. Streaming SVE is a more niche feature that is for even more specialized contexts. It is namely designed for complex matrix operations (and why it is introduced as part of SME), this is much like Intel AMX which also supports matrix operations, but does so in its own unique mechanism. Streaming SVE deviates from any of the existing approaches and so doesn't mesh cleanly. This is namely because it is a switch that allows user-mode to dynamically change the size of a vector. It works similarly to other dynamic CPU configuration that .NET has historically opted to not support. Some examples include IEEE 754 exceptions, changing the floating-point rounding mode, setting strict alignment, etc. Because of the scenario's its designed around and because of how it operates, it should be designed and considered separately. We may even determine it's not desirable to support at all, or should only be supported in a limited fashion. But we should not be restricting or hindering the more mainline customer scenario, nor should we be making it harder for them to utilize and integrate the new mainline functionality into their existing code and algorithms because this feature exists.
I don't think that's viable. If we're looking at matching what C/C++ requires, then there are a lot of restrictions put in place. That includes restricting things like structs with If we aren't looking at matching, then whether or not these objects can be declared on the heap has little impact on the JIT support. It may have some minimal impact on the GC support, but that's much less impacted with regions. Preventing the use of use in generics and their ability to implement interfaces is effectively going against everything we're trying to give users and ourselves around SIMD vectorization and will significantly raise the maintenance burden and complexity of supporting these types in the BCL, so much so that we may simply opt to not provide such paths.
I don't see them as independent. This also isn't just about AVX-512 light up works because it allows implicit use of new instructions with existing V128/V256 code paths and with existing patterns that developers are using with such code. Users can then additionally opt into providing a V512 code path if that is beneficial for their scenario and they can take explicit advantage of AVX-512 intrinsics, seamlessly, with the relevant IsSupported checks. This then further interplays with new prototypes, like By creating a new type and particularly by restricting it to be a
This is an explicit feature of SVE and which operating systems expose and allow to be configured on a per-app basis. For example, on Linux: https://www.kernel.org/doc/html/v5.8/arm64/sve.html#prctl-extensions It is completely appropriate and designed to allow this, so that a given application can default to the system configured size and can otherwise opt for a different size. .NET can and should take advantage of this, preferencing this to be set once at startup and then leaving it as UB if changed after (which is how C/C++ works as well).
It is one primary new type, which effectively involves taking
I don't understand this sentiment. We know what hardware exists today and is likely to exist in the next 3 years. We have existing plans and guidance documentation on how vectorized code works and how we are handling the needs of allowing both platform specific and cross platform code to exist and to allow seamless transition between them.
Making the new type a There's also no reason we can't block those for AOT or JIT scenarios. We are allowed to make breaking changes across major versions. Using vectorization is already decently niche compared to most things and the user doing something like this is effectively an anti-pattern and goes against how SIMD/vectorization is meant to be done. However, it likewise shouldn't really be overly complex to handle such cases, particularly if you aren't needing to consider
That seems like an overall worse world, for all the reasons listed above. It would, in my belief, strongly go against making SVE viable in .NET. It would likewise hinder any other platforms that have support for a length agnostic vector type. |
Please link #76047 to this issue too. |
Done. |
@kunalspathak, @jkotas, and I got together to sync up on the The conclusion we came to is that we recognize supporting all these modes are important and we recognize the potential conflict between what might make one easier to use vs making them all easier to use. We then believe we can sufficiently write an analyzer to help push users towards writing code that works well in AOT scenarios and by extension Streaming SVE scenarios for the JIT. It is then desirable to continue on the path of utilizing However, we realize that there are still some unknowns in this area and more investigatory work and discussion needs to be done to ensure that this approach is sound. We plan on doing that investigatory work and revisiting this in a few months time (likely around March-April) at which point we should have a better understanding of how problematic providing that around |
Once the full list of proposals for SVE is up, I can schedule them for API review. I can do the same for SVE2 and the other extensions when those issues are up. |
I chatted with @a74nh today about how to expose Streaming APIs and SVE APIs that are streaming compatible and here are some raw notes that came out about exposing streaming behavior. @a74nh - please add if I missed anything. Streaming instructions (and hence the .NET SME APIs) are executed when streaming mode is ON. In this mode, some of the SVE instructions do not work and hence "incompatible". We talked about how we should surface that information to the user level. We could force C# compiler to give compilation errors to user if they try to use non-streaming APIs in a method that is marked as "streaming" (option 1 below) or we can hide this abstraction and let JIT handle turning streaming on/off and save/restore streaming state (option 2 below) or we can have something in between (option 3). Option 1. C# supportAdd support in c# to give error explicitly to the user and possibly a new syntax of Advantages:
Disadvantages:
C++ compilers might end up in this option (except without a keyword). Option 2. JIT supportExpose prev_streaming_status = get_streaming_status(); // JIT inserted
turn_streaming_on(); // JIT inserted
call streaming_method();
set_streaming_status(prev_streaming_status); // JIT inserted However, this need to happen even on non-streaming methods i.e. all the other regular, neon methods. prev_streaming_status = get_streaming_status(); // JIT inserted
turn_streaming_off(); // JIT inserted
call non_streaming_method();
set_streaming_status(prev_streaming_status); // JIT inserted Advantages:
Disadvantages:
Option 3. Libraries supportIn addition to
Advantages:
Disadvantages:
Testing:
Usage[Local_Streaming]
MyFunc1()
{
A(); // streaming
B(); // streaming
}
[Streaming]
MyFunc2()
{
D(); // streaming
previous_state = SME.IsStreamingOn();
SME.TurnOnStreaming();
E(); // non-streaming
SME.SetStreamingMode(previous_state);
F(); // streaming
}
// Since "Local_Streaming", no need to save/restore
MyFunc1();
// Since "Streaming", save and restore
previous_state = SME.IsStreamingOn();
SME.TurnOnStreaming();
MyFunc2()
SME.SetStreamingMode(previous_state); References
TODO
|
The streaming mode changes the size of
I think that the best shape for the API to turn the streaming on/off would be a method that takes a delegate that should be executed in streaming mode. The analyzer can enforce that the target method is streaming compatible among other things. |
It's worth explicitly stating that the same general premise applies whether we use -- I just want to reiterate this, since a separate type will only solve a small subset of the overall considerations, namely what happens when you encounter the underlying vector type on the GC heap. So it isn't necessarily viable for us to use a separate type or duplicate the API surface between Streaming and Non-streaming modes.
I expect we need a little bit of a balance here given the two related, but somewhat differing considerations. We really have both For intrinsics themselves, most SVE instructions are streaming compatible and when For user-defined methods, I agree that we need some way to attribute them as compatible. This one is a little more difficult as what is compatible or incompatible depends on a few factors, including what the Given the limitations and the potential for unsafety, I almost want to say a viable way to handle this would be to introduce a new |
Nit: byref-like types cannot show up on the GC heap. byref-like type would address this particular concern by construction. I agree that byref-like type would not address all streaming safety issues by construction.
We do not know what this set. I expect that BCL is going be streaming incompatible by default. It would be prohibitively expensive to audit the whole BCL and annotate it as streaming compatible/incompatible. We may choose to audit and annotate small part of BCL as streaming compatible and this set can grow over time.
I do not see the benefit of unmanaged calling convention compared to an attribute. An analyzer to enforce the streaming compatible rules is the main gatekeeper. An attribute is a more natural way to mark methods that the analyzer should enforce the rules for. |
Right, sorry. That's what I meant. That is, I meant
👍. My only real concern is about the eventual scope creep of "sve compatible" and that it may explode to a much larger set of annotations. In practice the only code that should be "incompatible" is if they're using some of the SVE instructions that are incompatible without
Using just an attribute works, but it doesn't provide a clear boundary for users and will come with additional implications that the VM and JIT need to handle. For example, they may have to special case what can happen with inlining or optimizations across the boundary. They may also require additional tracking to know that we're in a method context that has SVE enabled. If a method is SVE compatible and can be used from both SVE and non-SVE code, then we may need to "specialize" for each to ensure good/efficient codegen. I'm concerned that this additional handling may not be "pay for play" in the JIT. However, for It's really much the same thing, just trying to play off the existing support and functionality the JIT has. |
I am not sure what kind of special tracking for UnmanagedCallersOnly you have in mind. The main parts of UnmanagedCallersOnly are:
The JIT is not able to inline through
There is a large spectrum of how the streaming support can look: from simple explicit approach that enables it and requires users to do more work, all the way to streaming-aware auto-vectorization on the other end. I think it would be reasonable to start with simple explicit approach. Something like:
|
That's what I had in mind in option 3 above. The only reason I introduced |
If we were to do automatically inserted mode switches, I think we would want to do them in both directions: turn on streaming around streaming required method calls and turn off streaming around streaming incompatible method calls. It gets more complicated with optimizations - for example, if there are two calls in a row or in a loop, should the JIT be smart enough to turn the streaming on/off just once for the two calls or around the whole loop? I would wait to see the evidence that the automatic mode switches are really needed. Starting with the fully explicit scheme should not prevent us from introducing automatically inserted mode switches later. |
I don't think this ideal:
My suggestion would be to use a method attribute which contains the name of the feature that disables the method. I hope that's something that's not too tricky to retrospectively add to an existing API, as for most use cases it's not changing the API. That would enable us to continue the SVE design, and then retrospectively add the attributes to SVE and AdvSimd. |
Do you have this inversed? Based on the architecture manual it looks like The manual states that
Users must be aware when an instruction might be supported or not. Any temporary state changes, like SME, would have to be accounted for as part of context switching to ensure other threads don't suddenly fail. Using attributes + an analyzer to determine if a given intrinsic API is supported is just a different approach to the same problem, but is inconsistent with how we've handled it so far. It also forces reliance on an analyzer for correctness, rather than a more implicit understanding that if Based on what the manual covers, we have the following separate checks:
Thus, it logically tracks that having 4 classes here would also work, each one exactly corresponding to a documented operation check around the support:
This ensures we have a very clean and clear separation of support that can be constant in most contexts. The only thing that really changes is in an SME enabled context and based on whether
Going into an SME enabled context requires |
Yes, you're right I had this inverted.
In a world without Quoting DDI0616B_a_SME_Supplement.pdf: For the avoidance of doubt, A64 scalar floating-point instructions which match following encoding patterns remain legal when the PE is in Streaming SVE mode: You plan on not making these available in C# when in streaming mode? |
I will setup a meeting with @tannergooding @a74nh @TamarChristinaArm to discuss this. |
Thanks for working on this. Very excited about future SVE2 support!
First step might be having GitHub hosted arm64 runners and images at all:
While useful for many developers, it might help if a request came from an internal Microsoft team. As for SVE and SVE2, there a CPU with Neoverse V1 (SVE) or Neoverse N2, V2, N3 or V3 cores is needed. Since Ampere doesn't have processors released with these cores, the Microsoft Azure Cobalt 100 CPU looks like the best options, which uses N2 cores. It seems all pieces of the puzzle are already in-house at Microsoft! |
Closing as completed for .NET 9 effort. We will create a new user story on Arm64 performance for .NET 10 and move some of the remaining items there. |
Until Armv8, NEON architecture enabled users to write vectorized code using SIMD instructions. The vector length for NEON instructions remains fixed at 128 bits. To take into account High Performance Computing applications, newer versions of Arm like v9 has developed Scalable Vector Extension (SVE). SVE is a new programming model that contains SIMD instructions operating on flexible vector length, ranging from 128 bits to 2048 bits. SVE2 extends SVE to enable domains like Computer vision, multimedia, etc. More details about SVE can be found on Arm's website. In .NET 9, we want to start the foundational work in .NET libraries and RyuJIT to add support for SVE2 feature.
Vector<T>
that represents flexible vector length depending on the underlying hardware and the idea is to expose SVE2 "flexible vector length" functionality usingVector<T>
. We need to think about the common instructions and validate if they can be represented by APIs that just takesVector<T>
as parameter. Additionally, SVE2 introduces "predicate register" to mark vector lanes active/inactive. We do not want to expose this concept as well to our consumer through .NET libraries because of reason mentioned above. Hence, for every API, need to come up with a pseudo code of how the API should be implemented internally in JIT such that the "predicate register" concept is created and consumed inside JIT implicitly. There is a good discussion that happened in the past about the API proposal in [API Proposal]: Example usages of a VectorSVE API #88140.System.Runtime.Intrinsic.Arm.SVE2
: Once design is finalized, need to add all the APIs in a newSVE2
class underSystem.Runtime.Intrinsic.Arm
. They need to be plumbed through the JIT (by transforming tree nodes to represent "predicate register" concept, if needed). all the way to generating code. Arm64: Implement SVE APIs #99957emitIns_*()
methods add the new instructions supportemitfmtsarm64.h
- needs to add new instruction formatting Arm64: SVE/SVE2 encodings #94285instrsarm64.h
- needs to add new instruction to instruction formatting mapping Arm64: SVE/SVE2 encodings #94285emitOutputInstr()
Zx
registers and predicate registers LSRA: Add support to track more than 64 registers #99658genArm64EmitterUnitTests()
formatEncode*
datainsSveIsLslN
/insSveGetLslOrModN
into table drive loop up.SVE_simd
andSVE_mask
are towards the end of the frame in order for offset in stack to take into consider the variable VL.Edit: A working prototype of the tool that generates 2 C++ files for encoding is in Arm64: SVE/SVE2 encodings #94285.
instrsarm64.h
that contains the existing as well as the new formats along with encoding binary representation and hexadecimal representation.emitfmtsarm64.h
file.mmmm
,dddd
, etc. in the binary representation of encoding, have to tool write a logic to produce the instruction bytes. In other words, this will generate code that can be pasted inemitOutputInstr()
function.INST*
likeINST9
orINST8
, etc. and regenerated in sorted order.Note that if we need to regenerate existing files like
instrsarm64.h
andemitfmtsarm64.h
, existing instruction's encoding also needs to be generated by the tool.movprfx
, we verify it follows all the rules with regards to destination registers, size, etc. Arm64/Sve: Validation for movprfx in emitter #105514SVE.IsSupported
should returnfalse
for it.Vector<T>
shows the right number of elements.ICorDebug
needs to be updated to know which SVE instructions have relative read/write/jumps - walker.cpp is specifically the place that needs to be updatedThe text was updated successfully, but these errors were encountered: