-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve support for 512-bit Vector<T> #111472
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CC @tannergooding. This is ready for review.
Would it be all right to add the alt-sized Vector<T>
configs to one of the jitstress-isas
pipelines?
@@ -17,6 +17,6 @@ | |||
</PropertyGroup> | |||
|
|||
<ItemGroup> | |||
<Compile Include="CpuId.cs" /> | |||
<Compile Include="../../../JIT/HardwareIntrinsics/X86/X86Base/CpuId.cs" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The local copy of these tests had been bit rotting for a while. I could bring it up to date, but it seems reasonable to just share it with JIT.
succeeded &= VectorByteCount == 32; | ||
// MaxVectorTBitWidth env variable can be used to change Vector<T> size. | ||
// We can only assume it is at least 16 bytes. | ||
succeeded &= VectorByteCount >= 16; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept changes to the regression tests minimal -- this is handled more robustly in the CpuId tests.
/azp run runtime-coreclr jitstress-isas-x86 |
Azure Pipelines successfully started running 1 pipeline(s). |
We do not guarantee alignment of writes in |
That's an issue with the code using It's not a blocker for this PR which enables |
@@ -686,6 +686,9 @@ RETAIL_CONFIG_DWORD_INFO(INTERNAL_GDBJitEmitDebugFrame, W("GDBJitEmitDebugFrame" | |||
#endif | |||
|
|||
RETAIL_CONFIG_DWORD_INFO_EX(EXTERNAL_MaxVectorTBitWidth, W("MaxVectorTBitWidth"), 0, "The maximum decimal width, in bits, that Vector<T> is allowed to be. A value less than 128 is treated as the system default.", CLRConfig::LookupOptions::ParseIntegerAsBase10) | |||
#if defined(TARGET_AMD64) || defined(TARGET_X86) | |||
RETAIL_CONFIG_DWORD_INFO_EX(EXTERNAL_PreferredVectorBitWidth, W("PreferredVectorBitWidth"), 0, "The maximum decimal width, in bits, of fixed-width vectors that may be considered hardware accelerated. A value less than 128 is treated as the system default.", CLRConfig::LookupOptions::ParseIntegerAsBase10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we avoid reading the config setting in both VM and the JIT so that there is just one source of truth?
I assume that the JIT should be able to get the PreferredVectorBitWidth from the VM's Vector<T>
now that Vector<T>
size is set based on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's still a disconnect, because we'll continue to default Vector<T>
to 256-bit even when PreferredVectorBitWidth
is 512 (or 0 with fully-accelerated AVX-512). i.e. 512-bit Vector<T>
will continue to be strictly opt-in for the foreseeable future.
The fact that the fixed-sized vectors are known only JIT-side is odd, and I've been thinking through how it could be cleaned up, but I haven't come up with anything that markedly improves what we have now. If you have ideas, I could give it a shot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'll continue to default Vector to 256-bit even when PreferredVectorBitWidth is 512
Is this change a UX improvement then? It seems to make the explanation of how the switches work and interact more complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's early enough in the cycle I think we can allow the default size of Vector to be 512 if the hardware supports it.
It would go out P2 at the latest and give us an early indicator if too many users are broken or regress in performance because of it. Such users can always do MaxVectorTBitWidth=256
as a workaround, particularly if the amount of breaks is "small", and we can revert and do the more complex configuration knobs if the signal is high enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what we will end up conceptually having in the "worst" case is the following:
// Simplified a bit as it doesn't show the handling for invalid user-specified values
// and doesn't show the values used if no user-specified value is given, but which
// are MaxVectorBitWidth and 256, respectively
DOTNET_PreferredVectorBitWidth = Min(UserSpecified, MaxVectorBitWidth)
DOTNET_MaxVectorTBitWidth = Min(Max(UserSpecified, 256), DOTNET_PreferredVectorBitWidth)
Where MaxVectorBitWidth
is the conceptual value the hardware supports and the "worst" case is the need to default Vector to no more than 256-bits in size.
This ensures that sizeof(Vector<T>)
is not greater than the maximum size that we report as hardware accelerated. For example, you could not on x64 have Vector512.IsHardwareAccelerated == false
and Vector<byte>.Count == 64
as that would add needless complexity to the user consideration.
Ideally though we can simplify it to the following:
DOTNET_PreferredVectorBitWidth = Min(UserSpecified, MaxVectorBitWidth)
DOTNET_MaxVectorTBitWidth = Min(UserSpecified, DOTNET_PreferredVectorBitWidth)
I think in both cases we don't need the JIT to handle anything, however. We can pick the sizeof Vector entirely on the VM side and then have the JIT simply get it via the compExactlyDependsOn(InstructionSet_VectorT512)
checks that currently propagate that size information to the JIT: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/compiler.h#L9285-L9330
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but then how does JIT decide which fixed-width classes to report as hardware accelerated? It can't use the VectorT ISA, because we have to continue to support Vector512.IsHardwareAccelerated == true
with sizeof(Vector<T>) == 32
.
It seems to make the explanation of how the switches work and interact more complicated.
I agree with this either way. The name PreferredVectorBitWidth
isn't helping. Perhaps that could be better named. Something like MaxAcceleratedVectorBitWidth
?
Then the largest fixed-width vector considered accelerated is Min(MaxAcceleratedVectorBitWidth, [Max hardware SIMD size])
And Vector<T>
size is Min(MaxVectorTBitWidth, MaxAcceleratedVectorBitWidth, [Max hardware SIMD size])
, where we can still default MaxVectorTBitWidth
to 256 for back compat.
That is, of course, the same thing I've implemented here, but the naming is easier to follow. I wouldn't mind taking on a cleanup task to deal with the 'virtual' ISAs that today only JIT knows about. Removing this duplication (if it stays) should be one of the goals of that effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but then how does JIT decide which fixed-width classes to report as hardware accelerated?
In the same way the VM currently sets one of InstructionSet_VectorT###
to indicate its size, it should likely be using InstructionSet_Vector###
to indicate acceleration.
That is the JIT doesn't need access to PreferredVectorBitWidth
nor MaxVectorTBitWidth
. Rather instead it knows sizeof(Vector<T>)
based on InstructionSet_VectorT###
and it knows the opts.preferredVectorByteLength
based on the largest InstructionSet_Vector###
.
It can then continue internally setting InstructionSet_Vector###
itself (after its computed the opts.preferredVectorByteLength
value) based on the InstructionSet_Isa
that are supported so that we still get the expected codegen, just without reporting itself as accelerated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InstructionSet_Vector###
is exactly the problem. Today, only JIT knows about those ISAs. Moving them to VM would be a breaking change for R2R and NAOT, because they would also have to pass those to JIT, and they don't.
Fixing that (and I do agree it needs to be fixed) is a much bigger change that is better done separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talked through this on Discord by explaining the difference between readytoruninstructionset
and corinfoinstructionset
. The latter isn't tracked by R2R but is used by both the JIT (codeman.cpp) and NAOT (CorInfoImpl.cs) VMs. Accordingly, we should be able to do all the selection on the VM side. The nuance was we can't get the JIT info from just the Vector<T>
size unless we agree that Vector<T>
implicitly grows up to PreferredVectorBitWidth
. There are concerns with that due to it historically being no larger than 32-bytes and the likelyhood code may break (bad assumptions were hardcoded) or silently regress (code paths no longer accelerate but were guarded against larger sizes).
The safest thing for us, given that concern, is to have the VM (https://github.com/dotnet/runtime/blob/main/src/coreclr/vm/codeman.cpp#L1225 and https://github.com/dotnet/runtime/blob/main/src/coreclr/tools/Common/JitInterface/CorInfoImpl.cs#L4197) start setting InstructionSet_Vector###
based on DOTNET_PreferredVectorBitWidth
. This ensures we have a single "source of truth" and avoids mirroring config knob access between the JIT/VM. We can then, in a separate PR, do a one line change in the VM to experiment with defaulting Vector to be the same as PreferredVectorBitWidth
, rather than restricting it to no more than 256 by default. This makes it easy to back out if the concerns are founded and decently widespread breaks are found.
This keeps things simple with one source of truth and doesn't require more comprehensive refactoring to achieve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I believe I have done what we discussed. It did allow removing a JIT flag and got the throttling logic completely on the VM side, but I couldn't move the Vector###
ISAs completely over because the meanings are still overloaded. e.g. InstructionSet_Vector256
can't mean we want fully-accelerated Vector256 (AVX2+PreferredVectorBitWidth>=256) because it still means partially-accelerated Vector256 (AVX) is actually supported.
I also found that we were doing the Vector512 throttling on the AOT side, so I've duplicated the logic to clamp Vector<T>
size there as well. So we still have duplicated logic, but I suppose it's a bit cleaner now.
Fixes #104978
Vector<T>
or assumed AVX2 support meantVector<T>
was 256-bit--max-vectort-bitwidth:512
in AOT and adds a smoke test for itVector<T>
is no larger thanPreferredVectorBitWidth
Tests pass locally (AVX-512 machine) with
DOTNET_MaxVectorTBitWidth=512
andDOTNET_MaxVectorTBitWidth=128
I believe this also addresses any problems that prevented opportunistic enablement of AVX2 on AOT when AVX is enabled, but I'll do some more testing around that and make a separate PR for that change if all is well.