Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve support for 512-bit Vector<T> #111472

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

saucecontrol
Copy link
Member

@saucecontrol saucecontrol commented Jan 15, 2025

Fixes #104978

  • Fixes tests that had bogus failures (e.g. not enough test data)
  • Adds tests to cover some existing broken paths
  • Fixes JIT and CoreLib paths that did not support 512-bit Vector<T> or assumed AVX2 support meant Vector<T> was 256-bit
  • Enables support for --max-vectort-bitwidth:512 in AOT and adds a smoke test for it
  • Ensures Vector<T> is no larger than PreferredVectorBitWidth

Tests pass locally (AVX-512 machine) with DOTNET_MaxVectorTBitWidth=512 and DOTNET_MaxVectorTBitWidth=128

I believe this also addresses any problems that prevented opportunistic enablement of AVX2 on AOT when AVX is enabled, but I'll do some more testing around that and make a separate PR for that change if all is well.

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jan 15, 2025
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Jan 15, 2025
Copy link
Member Author

@saucecontrol saucecontrol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC @tannergooding. This is ready for review.

Would it be all right to add the alt-sized Vector<T> configs to one of the jitstress-isas pipelines?

@@ -17,6 +17,6 @@
</PropertyGroup>

<ItemGroup>
<Compile Include="CpuId.cs" />
<Compile Include="../../../JIT/HardwareIntrinsics/X86/X86Base/CpuId.cs" />
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The local copy of these tests had been bit rotting for a while. I could bring it up to date, but it seems reasonable to just share it with JIT.

succeeded &= VectorByteCount == 32;
// MaxVectorTBitWidth env variable can be used to change Vector<T> size.
// We can only assume it is at least 16 bytes.
succeeded &= VectorByteCount >= 16;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept changes to the regression tests minimal -- this is handled more robustly in the CpuId tests.

@tannergooding
Copy link
Member

/azp run runtime-coreclr jitstress-isas-x86

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@xtqqczze
Copy link
Contributor

xtqqczze commented Jan 17, 2025

We do not guarantee alignment of writes in SpanHelpers.Fill<T>, which means using an AVX-512 path could potentially regress performance due to stores that cross a cache line boundary (64 bytes). Since AVX-512 operates with a width of 64 bytes, any misalignment of writes ensures that a cache line split will occur.

@tannergooding
Copy link
Member

which means using an AVX-512 path could potentially regress performance due to stores that cross a cache line boundary (64 bytes)

That's an issue with the code using Vector<T> without either pinning or attempting opportunistic alignment. The issue already exists due to the variable sized nature of Vector<T> and us not guaranteeing a maximize size.

It's not a blocker for this PR which enables opt-in usage of Vector<T> as Vector512<T>, nor would it be an issue for say Arm64 SVE where Vector<T> can be up to 2048 bits (although no such hardware exists today).

@@ -686,6 +686,9 @@ RETAIL_CONFIG_DWORD_INFO(INTERNAL_GDBJitEmitDebugFrame, W("GDBJitEmitDebugFrame"
#endif

RETAIL_CONFIG_DWORD_INFO_EX(EXTERNAL_MaxVectorTBitWidth, W("MaxVectorTBitWidth"), 0, "The maximum decimal width, in bits, that Vector<T> is allowed to be. A value less than 128 is treated as the system default.", CLRConfig::LookupOptions::ParseIntegerAsBase10)
#if defined(TARGET_AMD64) || defined(TARGET_X86)
RETAIL_CONFIG_DWORD_INFO_EX(EXTERNAL_PreferredVectorBitWidth, W("PreferredVectorBitWidth"), 0, "The maximum decimal width, in bits, of fixed-width vectors that may be considered hardware accelerated. A value less than 128 is treated as the system default.", CLRConfig::LookupOptions::ParseIntegerAsBase10)
Copy link
Member

@jkotas jkotas Jan 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we avoid reading the config setting in both VM and the JIT so that there is just one source of truth?

I assume that the JIT should be able to get the PreferredVectorBitWidth from the VM's Vector<T> now that Vector<T> size is set based on it.

Copy link
Member Author

@saucecontrol saucecontrol Jan 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's still a disconnect, because we'll continue to default Vector<T> to 256-bit even when PreferredVectorBitWidth is 512 (or 0 with fully-accelerated AVX-512). i.e. 512-bit Vector<T> will continue to be strictly opt-in for the foreseeable future.

The fact that the fixed-sized vectors are known only JIT-side is odd, and I've been thinking through how it could be cleaned up, but I haven't come up with anything that markedly improves what we have now. If you have ideas, I could give it a shot.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'll continue to default Vector to 256-bit even when PreferredVectorBitWidth is 512

Is this change a UX improvement then? It seems to make the explanation of how the switches work and interact more complicated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's early enough in the cycle I think we can allow the default size of Vector to be 512 if the hardware supports it.

It would go out P2 at the latest and give us an early indicator if too many users are broken or regress in performance because of it. Such users can always do MaxVectorTBitWidth=256 as a workaround, particularly if the amount of breaks is "small", and we can revert and do the more complex configuration knobs if the signal is high enough.

Copy link
Member

@tannergooding tannergooding Jan 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what we will end up conceptually having in the "worst" case is the following:

// Simplified a bit as it doesn't show the handling for invalid user-specified values
// and doesn't show the values used if no user-specified value is given, but which
// are MaxVectorBitWidth and 256, respectively

DOTNET_PreferredVectorBitWidth = Min(UserSpecified, MaxVectorBitWidth)
DOTNET_MaxVectorTBitWidth =  Min(Max(UserSpecified, 256), DOTNET_PreferredVectorBitWidth)

Where MaxVectorBitWidth is the conceptual value the hardware supports and the "worst" case is the need to default Vector to no more than 256-bits in size.

This ensures that sizeof(Vector<T>) is not greater than the maximum size that we report as hardware accelerated. For example, you could not on x64 have Vector512.IsHardwareAccelerated == false and Vector<byte>.Count == 64 as that would add needless complexity to the user consideration.

Ideally though we can simplify it to the following:

DOTNET_PreferredVectorBitWidth = Min(UserSpecified, MaxVectorBitWidth)
DOTNET_MaxVectorTBitWidth =  Min(UserSpecified, DOTNET_PreferredVectorBitWidth)

I think in both cases we don't need the JIT to handle anything, however. We can pick the sizeof Vector entirely on the VM side and then have the JIT simply get it via the compExactlyDependsOn(InstructionSet_VectorT512) checks that currently propagate that size information to the JIT: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/compiler.h#L9285-L9330

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but then how does JIT decide which fixed-width classes to report as hardware accelerated? It can't use the VectorT ISA, because we have to continue to support Vector512.IsHardwareAccelerated == true with sizeof(Vector<T>) == 32.

It seems to make the explanation of how the switches work and interact more complicated.

I agree with this either way. The name PreferredVectorBitWidth isn't helping. Perhaps that could be better named. Something like MaxAcceleratedVectorBitWidth?

Then the largest fixed-width vector considered accelerated is Min(MaxAcceleratedVectorBitWidth, [Max hardware SIMD size])

And Vector<T> size is Min(MaxVectorTBitWidth, MaxAcceleratedVectorBitWidth, [Max hardware SIMD size]), where we can still default MaxVectorTBitWidth to 256 for back compat.

That is, of course, the same thing I've implemented here, but the naming is easier to follow. I wouldn't mind taking on a cleanup task to deal with the 'virtual' ISAs that today only JIT knows about. Removing this duplication (if it stays) should be one of the goals of that effort.

Copy link
Member

@tannergooding tannergooding Jan 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but then how does JIT decide which fixed-width classes to report as hardware accelerated?

In the same way the VM currently sets one of InstructionSet_VectorT### to indicate its size, it should likely be using InstructionSet_Vector### to indicate acceleration.

That is the JIT doesn't need access to PreferredVectorBitWidth nor MaxVectorTBitWidth. Rather instead it knows sizeof(Vector<T>) based on InstructionSet_VectorT### and it knows the opts.preferredVectorByteLength based on the largest InstructionSet_Vector###.

It can then continue internally setting InstructionSet_Vector### itself (after its computed the opts.preferredVectorByteLength value) based on the InstructionSet_Isa that are supported so that we still get the expected codegen, just without reporting itself as accelerated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InstructionSet_Vector### is exactly the problem. Today, only JIT knows about those ISAs. Moving them to VM would be a breaking change for R2R and NAOT, because they would also have to pass those to JIT, and they don't.

Fixing that (and I do agree it needs to be fixed) is a much bigger change that is better done separately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked through this on Discord by explaining the difference between readytoruninstructionset and corinfoinstructionset. The latter isn't tracked by R2R but is used by both the JIT (codeman.cpp) and NAOT (CorInfoImpl.cs) VMs. Accordingly, we should be able to do all the selection on the VM side. The nuance was we can't get the JIT info from just the Vector<T> size unless we agree that Vector<T> implicitly grows up to PreferredVectorBitWidth. There are concerns with that due to it historically being no larger than 32-bytes and the likelyhood code may break (bad assumptions were hardcoded) or silently regress (code paths no longer accelerate but were guarded against larger sizes).

The safest thing for us, given that concern, is to have the VM (https://github.com/dotnet/runtime/blob/main/src/coreclr/vm/codeman.cpp#L1225 and https://github.com/dotnet/runtime/blob/main/src/coreclr/tools/Common/JitInterface/CorInfoImpl.cs#L4197) start setting InstructionSet_Vector### based on DOTNET_PreferredVectorBitWidth. This ensures we have a single "source of truth" and avoids mirroring config knob access between the JIT/VM. We can then, in a separate PR, do a one line change in the VM to experiment with defaulting Vector to be the same as PreferredVectorBitWidth, rather than restricting it to no more than 256 by default. This makes it easy to back out if the concerns are founded and decently widespread breaks are found.

This keeps things simple with one source of truth and doesn't require more comprehensive refactoring to achieve.

Copy link
Member Author

@saucecontrol saucecontrol Jan 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I believe I have done what we discussed. It did allow removing a JIT flag and got the throttling logic completely on the VM side, but I couldn't move the Vector### ISAs completely over because the meanings are still overloaded. e.g. InstructionSet_Vector256 can't mean we want fully-accelerated Vector256 (AVX2+PreferredVectorBitWidth>=256) because it still means partially-accelerated Vector256 (AVX) is actually supported.

I also found that we were doing the Vector512 throttling on the AOT side, so I've duplicated the logic to clamp Vector<T> size there as well. So we still have duplicated logic, but I suppose it's a bit cleaner now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DOTNET_PreferredVectorBitWidth is not respected by Vector<T>
4 participants