Improve support for 512-bit Vector<T> #111472

saucecontrol · 2025-01-15T21:44:49Z

Fixes tests that had bogus failures (e.g. not enough test data)
Adds tests to cover some existing broken paths
Fixes JIT and CoreLib paths that did not support 512-bit Vector<T> or assumed AVX2 support meant Vector<T> was 256-bit
Enables support for --max-vectort-bitwidth:512 in AOT and adds a smoke test for it
Ensures Vector<T> is no larger than PreferredVectorBitWidth

Tests pass locally (AVX-512 machine) with DOTNET_MaxVectorTBitWidth=512 and DOTNET_MaxVectorTBitWidth=128

I believe this also addresses any problems that prevented opportunistic enablement of AVX2 on AOT when AVX is enabled, but I'll do some more testing around that and make a separate PR for that change if all is well.

saucecontrol

CC @tannergooding. This is ready for review.

Would it be all right to add the alt-sized Vector<T> configs to one of the jitstress-isas pipelines?

saucecontrol · 2025-01-16T02:44:48Z

src/tests/readytorun/HardwareIntrinsics/X86/CpuId_R2R_Avx.csproj

@@ -17,6 +17,6 @@
  </PropertyGroup>

  <ItemGroup>
-    <Compile Include="CpuId.cs" />
+    <Compile Include="../../../JIT/HardwareIntrinsics/X86/X86Base/CpuId.cs" />


The local copy of these tests had been bit rotting for a while. I could bring it up to date, but it seems reasonable to just share it with JIT.

saucecontrol · 2025-01-16T02:50:59Z

src/tests/JIT/Regression/JitBlue/Runtime_34587/Runtime_34587.cs

-                succeeded &= VectorByteCount == 32;
+                // MaxVectorTBitWidth env variable can be used to change Vector<T> size.
+                // We can only assume it is at least 16 bytes.
+                succeeded &= VectorByteCount >= 16;


I kept changes to the regression tests minimal -- this is handled more robustly in the CpuId tests.

tannergooding · 2025-01-16T19:02:55Z

/azp run runtime-coreclr jitstress-isas-x86

azure-pipelines · 2025-01-16T19:03:14Z

Azure Pipelines successfully started running 1 pipeline(s).

xtqqczze · 2025-01-17T15:54:33Z

We do not guarantee alignment of writes in SpanHelpers.Fill<T>, which means using an AVX-512 path could potentially regress performance due to stores that cross a cache line boundary (64 bytes). Since AVX-512 operates with a width of 64 bytes, any misalignment of writes ensures that a cache line split will occur.

tannergooding · 2025-01-17T17:09:36Z

which means using an AVX-512 path could potentially regress performance due to stores that cross a cache line boundary (64 bytes)

That's an issue with the code using Vector<T> without either pinning or attempting opportunistic alignment. The issue already exists due to the variable sized nature of Vector<T> and us not guaranteeing a maximize size.

It's not a blocker for this PR which enables opt-in usage of Vector<T> as Vector512<T>, nor would it be an issue for say Arm64 SVE where Vector<T> can be up to 2048 bits (although no such hardware exists today).

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.T.cs

jkotas · 2025-01-18T00:31:06Z

src/coreclr/inc/clrconfigvalues.h

@@ -686,6 +686,9 @@ RETAIL_CONFIG_DWORD_INFO(INTERNAL_GDBJitEmitDebugFrame, W("GDBJitEmitDebugFrame"
 #endif

 RETAIL_CONFIG_DWORD_INFO_EX(EXTERNAL_MaxVectorTBitWidth,           W("MaxVectorTBitWidth"),        0, "The maximum decimal width, in bits, that Vector<T> is allowed to be. A value less than 128 is treated as the system default.", CLRConfig::LookupOptions::ParseIntegerAsBase10)
+#if defined(TARGET_AMD64) || defined(TARGET_X86)
+RETAIL_CONFIG_DWORD_INFO_EX(EXTERNAL_PreferredVectorBitWidth,      W("PreferredVectorBitWidth"),   0, "The maximum decimal width, in bits, of fixed-width vectors that may be considered hardware accelerated. A value less than 128 is treated as the system default.", CLRConfig::LookupOptions::ParseIntegerAsBase10)


Should we avoid reading the config setting in both VM and the JIT so that there is just one source of truth?

I assume that the JIT should be able to get the PreferredVectorBitWidth from the VM's Vector<T> now that Vector<T> size is set based on it.

There's still a disconnect, because we'll continue to default Vector<T> to 256-bit even when PreferredVectorBitWidth is 512 (or 0 with fully-accelerated AVX-512). i.e. 512-bit Vector<T> will continue to be strictly opt-in for the foreseeable future.

The fact that the fixed-sized vectors are known only JIT-side is odd, and I've been thinking through how it could be cleaned up, but I haven't come up with anything that markedly improves what we have now. If you have ideas, I could give it a shot.

we'll continue to default Vector to 256-bit even when PreferredVectorBitWidth is 512

Is this change a UX improvement then? It seems to make the explanation of how the switches work and interact more complicated.

It's early enough in the cycle I think we can allow the default size of Vector to be 512 if the hardware supports it.

It would go out P2 at the latest and give us an early indicator if too many users are broken or regress in performance because of it. Such users can always do MaxVectorTBitWidth=256 as a workaround, particularly if the amount of breaks is "small", and we can revert and do the more complex configuration knobs if the signal is high enough.

I think what we will end up conceptually having in the "worst" case is the following:

// Simplified a bit as it doesn't show the handling for invalid user-specified values // and doesn't show the values used if no user-specified value is given, but which // are MaxVectorBitWidth and 256, respectively DOTNET_PreferredVectorBitWidth = Min(UserSpecified, MaxVectorBitWidth) DOTNET_MaxVectorTBitWidth = Min(Max(UserSpecified, 256), DOTNET_PreferredVectorBitWidth)

Where MaxVectorBitWidth is the conceptual value the hardware supports and the "worst" case is the need to default Vector to no more than 256-bits in size.

This ensures that sizeof(Vector<T>) is not greater than the maximum size that we report as hardware accelerated. For example, you could not on x64 have Vector512.IsHardwareAccelerated == false and Vector<byte>.Count == 64 as that would add needless complexity to the user consideration.

Ideally though we can simplify it to the following:

DOTNET_PreferredVectorBitWidth = Min(UserSpecified, MaxVectorBitWidth) DOTNET_MaxVectorTBitWidth = Min(UserSpecified, DOTNET_PreferredVectorBitWidth)

I think in both cases we don't need the JIT to handle anything, however. We can pick the sizeof Vector entirely on the VM side and then have the JIT simply get it via the compExactlyDependsOn(InstructionSet_VectorT512) checks that currently propagate that size information to the JIT: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/compiler.h#L9285-L9330

Right, but then how does JIT decide which fixed-width classes to report as hardware accelerated? It can't use the VectorT ISA, because we have to continue to support Vector512.IsHardwareAccelerated == true with sizeof(Vector<T>) == 32.

It seems to make the explanation of how the switches work and interact more complicated.

I agree with this either way. The name PreferredVectorBitWidth isn't helping. Perhaps that could be better named. Something like MaxAcceleratedVectorBitWidth?

Then the largest fixed-width vector considered accelerated is Min(MaxAcceleratedVectorBitWidth, [Max hardware SIMD size])

And Vector<T> size is Min(MaxVectorTBitWidth, MaxAcceleratedVectorBitWidth, [Max hardware SIMD size]), where we can still default MaxVectorTBitWidth to 256 for back compat.

That is, of course, the same thing I've implemented here, but the naming is easier to follow. I wouldn't mind taking on a cleanup task to deal with the 'virtual' ISAs that today only JIT knows about. Removing this duplication (if it stays) should be one of the goals of that effort.

Right, but then how does JIT decide which fixed-width classes to report as hardware accelerated?

In the same way the VM currently sets one of InstructionSet_VectorT### to indicate its size, it should likely be using InstructionSet_Vector### to indicate acceleration.

That is the JIT doesn't need access to PreferredVectorBitWidth nor MaxVectorTBitWidth. Rather instead it knows sizeof(Vector<T>) based on InstructionSet_VectorT### and it knows the opts.preferredVectorByteLength based on the largest InstructionSet_Vector###.

It can then continue internally setting InstructionSet_Vector### itself (after its computed the opts.preferredVectorByteLength value) based on the InstructionSet_Isa that are supported so that we still get the expected codegen, just without reporting itself as accelerated.

InstructionSet_Vector### is exactly the problem. Today, only JIT knows about those ISAs. Moving them to VM would be a breaking change for R2R and NAOT, because they would also have to pass those to JIT, and they don't.

Fixing that (and I do agree it needs to be fixed) is a much bigger change that is better done separately.

Talked through this on Discord by explaining the difference between readytoruninstructionset and corinfoinstructionset. The latter isn't tracked by R2R but is used by both the JIT (codeman.cpp) and NAOT (CorInfoImpl.cs) VMs. Accordingly, we should be able to do all the selection on the VM side. The nuance was we can't get the JIT info from just the Vector<T> size unless we agree that Vector<T> implicitly grows up to PreferredVectorBitWidth. There are concerns with that due to it historically being no larger than 32-bytes and the likelyhood code may break (bad assumptions were hardcoded) or silently regress (code paths no longer accelerate but were guarded against larger sizes).

The safest thing for us, given that concern, is to have the VM (https://github.com/dotnet/runtime/blob/main/src/coreclr/vm/codeman.cpp#L1225 and https://github.com/dotnet/runtime/blob/main/src/coreclr/tools/Common/JitInterface/CorInfoImpl.cs#L4197) start setting InstructionSet_Vector### based on DOTNET_PreferredVectorBitWidth. This ensures we have a single "source of truth" and avoids mirroring config knob access between the JIT/VM. We can then, in a separate PR, do a one line change in the VM to experiment with defaulting Vector to be the same as PreferredVectorBitWidth, rather than restricting it to no more than 256 by default. This makes it easy to back out if the concerns are founded and decently widespread breaks are found.

This keeps things simple with one source of truth and doesn't require more comprehensive refactoring to achieve.

OK, I believe I have done what we discussed. It did allow removing a JIT flag and got the throttling logic completely on the VM side, but I couldn't move the Vector### ISAs completely over because the meanings are still overloaded. e.g. InstructionSet_Vector256 can't mean we want fully-accelerated Vector256 (AVX2+PreferredVectorBitWidth>=256) because it still means partially-accelerated Vector256 (AVX) is actually supported.

I also found that we were doing the Vector512 throttling on the AOT side, so I've duplicated the logic to clamp Vector<T> size there as well. So we still have duplicated logic, but I suppose it's a bit cleaner now.

improve support for 512-bit Vector<T>

3bc39e4

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jan 15, 2025

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Jan 15, 2025

fix size check

eae32e9

saucecontrol commented Jan 16, 2025

View reviewed changes

saucecontrol marked this pull request as ready for review January 16, 2025 02:58

saucecontrol requested a review from MichalStrehovsky as a code owner January 16, 2025 02:58

saucecontrol added 2 commits January 16, 2025 10:16

add MaxVectorTBitWidth configs to jitstress-isas-x86 pipeline

699f15e

Merge remote-tracking branch 'upstream/main' into vectort512

8064d04

This was referenced Jan 16, 2025

System.Data.OleDb.Tests timeout in net48 x86 Release leg #87783

Open

[WASI] Sockets - unknown handle index #108726

Open

clamp Vector<T> size to largest accelerated fixed-sized vector

94cb2c5

xtqqczze reviewed Jan 17, 2025

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.T.cs Show resolved Hide resolved

jkotas reviewed Jan 18, 2025

View reviewed changes

saucecontrol added 3 commits January 18, 2025 11:47

Merge remote-tracking branch 'upstream/main' into vectort512

c7dab5b

move PreferredVectorBitWidth logic to VM

2c26436

Merge branch 'main' into vectort512

70f391c

saucecontrol mentioned this pull request Jan 26, 2025

JIT: Simplify x86 special intrinsic imports #111836

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve support for 512-bit Vector<T> #111472

Improve support for 512-bit Vector<T> #111472

saucecontrol commented Jan 15, 2025 •

edited

Loading

saucecontrol left a comment

saucecontrol Jan 16, 2025

saucecontrol Jan 16, 2025

tannergooding commented Jan 16, 2025

azure-pipelines bot commented Jan 16, 2025

xtqqczze commented Jan 17, 2025 •

edited

Loading

tannergooding commented Jan 17, 2025

jkotas Jan 18, 2025 •

edited

Loading

saucecontrol Jan 18, 2025 •

edited

Loading

jkotas Jan 18, 2025

tannergooding Jan 18, 2025

tannergooding Jan 18, 2025 •

edited

Loading

saucecontrol Jan 18, 2025

tannergooding Jan 18, 2025 •

edited

Loading

saucecontrol Jan 18, 2025

tannergooding Jan 18, 2025

saucecontrol Jan 19, 2025 •

edited

Loading

Improve support for 512-bit Vector<T> #111472

Are you sure you want to change the base?

Improve support for 512-bit Vector<T> #111472

Conversation

saucecontrol commented Jan 15, 2025 • edited Loading

saucecontrol left a comment

Choose a reason for hiding this comment

saucecontrol Jan 16, 2025

Choose a reason for hiding this comment

saucecontrol Jan 16, 2025

Choose a reason for hiding this comment

tannergooding commented Jan 16, 2025

azure-pipelines bot commented Jan 16, 2025

xtqqczze commented Jan 17, 2025 • edited Loading

tannergooding commented Jan 17, 2025

jkotas Jan 18, 2025 • edited Loading

Choose a reason for hiding this comment

saucecontrol Jan 18, 2025 • edited Loading

Choose a reason for hiding this comment

jkotas Jan 18, 2025

Choose a reason for hiding this comment

tannergooding Jan 18, 2025

Choose a reason for hiding this comment

tannergooding Jan 18, 2025 • edited Loading

Choose a reason for hiding this comment

saucecontrol Jan 18, 2025

Choose a reason for hiding this comment

tannergooding Jan 18, 2025 • edited Loading

Choose a reason for hiding this comment

saucecontrol Jan 18, 2025

Choose a reason for hiding this comment

tannergooding Jan 18, 2025

Choose a reason for hiding this comment

saucecontrol Jan 19, 2025 • edited Loading

Choose a reason for hiding this comment

saucecontrol commented Jan 15, 2025 •

edited

Loading

xtqqczze commented Jan 17, 2025 •

edited

Loading

jkotas Jan 18, 2025 •

edited

Loading

saucecontrol Jan 18, 2025 •

edited

Loading

tannergooding Jan 18, 2025 •

edited

Loading

tannergooding Jan 18, 2025 •

edited

Loading

saucecontrol Jan 19, 2025 •

edited

Loading