-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement ShuffleNative
methods and optimise Shuffle
for non-constant indices
#99596
base: main
Are you sure you want to change the base?
Conversation
Note regarding the
|
Benchmark results of my AVX2 code ( Yes, this is a very micro benchmark, but results are pretty reproducible on my machine (within ~%10 usually), and are probably pretty close to reality since it should be pretty quick (but obviously this doesn't measure the overhead with surrounding code due to more pipeline usage, etc.). (edit: there was likely an issue with this benchmark turning into a no-op) |
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector256.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for looking into this again.
We're already using the so-far-internal Vector128.ShuffleUnsafe
in a bunch of places. Should we be using Vector256.ShuffleUnsafe
somewhere?
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector256.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector256.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector256.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector256.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector256.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector64.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector64.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector256.cs
Outdated
Show resolved
Hide resolved
It seems to me that all the current uses of |
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128.cs
Outdated
Show resolved
Hide resolved
Can someone please check I won't accidentally regress mono :) |
Mono changes look good to me. Thanks for your contribution. |
Re: 9868e73 |
I don't think there's any issue with the runtime relying on specific behaviour. For external libraries, I think one of the following approaches makes sense:
I think the approach needs to be consistent for all of them, so I removed the Another option, which I briefly mentioned in a comment somewhere, is to expose a variant like |
I'm fine with only documenting "anything above 15 is UB". |
Yes, I've been careful to not use the AVX-512 one for this method for this reason. I will add a comment at some point to explain this in the method (assuming I don't forget). |
ShuffleUnsafe
methods and optimise Shuffle
for non-constant indices
ShuffleNative
methods and optimise Shuffle
for non-constant indices
- And do a full review of the logic & comments of all my code - Fix 3 bugs (2 pre-existing) - Simplify/improve some code
// selected for every element. We can special-case this down to just returning a zero node | ||
return gtNewZeroConNode(type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tannergooding for when you're reviewing: I believe the code here before was incorrect, as it didn't keep the side effects of op1
. Consider something like this in the tree V128.Create(/*something with side effects*/, V128<byte>.Zero)
. Please let me know if I'm missing something and should revert it.
CORINFO_TYPE_ULONG, simdSize); | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tannergooding for when you're reviewing: I believe the code here before was incorrect, as it didn't check the element size (and thus I think it could fire for something like V128.Shuffle(a, V128.Create((byte)2, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0))
). I've moved it later and checked vecCns
instead, this way it should work for all cases of lane swapping. Please let me know if I'm missing something and should revert it.
src/coreclr/jit/gentree.cpp
Outdated
// swap the low and high 128-bit lanes | ||
// Vector256<byte> swap = Avx2.Permute2x128(vector, vector, 0b00000001); | ||
GenTree* swap; | ||
if (!op1->IsCnsVec()) | ||
{ | ||
GenTree* op1Dup1 = fgMakeMultiUse(&op1); | ||
GenTree* op1Dup2 = gtCloneExpr(op1Dup1); | ||
|
||
uint8_t control = 1; | ||
cnsNode = gtNewIconNode(control, TYP_INT); | ||
swap = gtNewSimdHWIntrinsicNode(type, op1Dup1, op1Dup2, cnsNode, NI_AVX2_Permute2x128, simdBaseJitType, | ||
simdSize); | ||
} | ||
else | ||
{ | ||
// if we have a constant, keep it constant | ||
GenTree* op1Dup1 = fgMakeMultiUse(&op1); | ||
swap = op1Dup1; | ||
|
||
simd_t* cnsPtr = &op1Dup1->AsVecCon()->gtSimdVal; | ||
std::swap(cnsPtr->u64[0], cnsPtr->u64[2]); | ||
std::swap(cnsPtr->u64[1], cnsPtr->u64[3]); | ||
} | ||
|
||
// shuffle with both the normal and swapped values | ||
// Vector256<byte> shuf1 = Avx2.Shuffle(vector, indices); | ||
// Vector256<byte> shuf2 = Avx2.Shuffle(swap, indices); | ||
GenTree* shuf1 = gtNewSimdHWIntrinsicNode(type, op1, op2, NI_AVX2_Shuffle, simdBaseJitType, simdSize); | ||
GenTree* shuf2 = gtNewSimdHWIntrinsicNode(type, swap, op2Dup1, NI_AVX2_Shuffle, simdBaseJitType, simdSize); | ||
|
||
// get the indices, and xor the cross-lane bit on the high 128-bit lane part of indices. | ||
// V256<byte> indicesXord = indices ^ V256.Create(V128.Create((byte)0), V128.Create((byte)0x10))); | ||
simd_t xorCns = {}; | ||
xorCns.u64[0] = 0; | ||
xorCns.u64[1] = 0; | ||
xorCns.u64[2] = 0x1010101010101010; | ||
xorCns.u64[3] = 0x1010101010101010; | ||
cnsNode = gtNewVconNode(type); | ||
cnsNode->AsVecCon()->gtSimdVal = xorCns; | ||
|
||
GenTree* indicesXord = gtNewSimdBinOpNode(GT_XOR, type, op2Dup2, cnsNode, simdBaseJitType, simdSize); | ||
|
||
// compare our modified indices to 0x0F (highest value not swapping lane), we get 0xFF when we are swapping | ||
// lane and 0x00 otherwise. we will also get "swapping lane" also when index is more than 32 (but no high bit), | ||
// but this is normalised later for Shuffle, and acceptable for ShuffleNative. | ||
// V256<byte> selection = Avx2.CompareGreaterThan(indicesXord.AsSByte(), V256.Create((sbyte)0x0F)).AsByte(); | ||
simd_t comparandCnd = {}; | ||
comparandCnd.u64[0] = 0x0F0F0F0F0F0F0F0F; | ||
comparandCnd.u64[1] = 0x0F0F0F0F0F0F0F0F; | ||
comparandCnd.u64[2] = 0x0F0F0F0F0F0F0F0F; | ||
comparandCnd.u64[3] = 0x0F0F0F0F0F0F0F0F; | ||
cnsNode = gtNewVconNode(type); | ||
cnsNode->AsVecCon()->gtSimdVal = comparandCnd; | ||
GenTree* selection = gtNewSimdCmpOpNode(GT_GT, type, indicesXord, cnsNode, CORINFO_TYPE_BYTE, simdSize); | ||
|
||
// blend our two shuffles based on whether each element swaps lanes or not | ||
// return Avx2.BlendVariable(shuf1, shuf2, selection); | ||
retNode = gtNewSimdHWIntrinsicNode(type, shuf1, shuf2, selection, NI_AVX2_BlendVariable, simdBaseJitType, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tannergooding for when you review: when I was originally testing this, this code had better performance when it was ordered slightly differently to here (due to better pipelining). Ideally we start swap
before shuf1
and indicesXord
before shuf1
iirc. Please advise if this is possible and how to do it. Presumably it's something with gtWrapWithSideEffects
(?), but I couldn't quite figure it out.
Edit: this is the order I had that performed the best. Ignore the And
, that's from the old version.
- This is a nice-to-have, not a correctness issue - it matches what we're doing throughout the rest of the function, and keeps the asm more readable imo
- There's other places where they're not normalised. could be done, but not really worth the trouble probably
Shuffle
with variable indices on coreclr (for all types)Shuffle
onVector256
(with signed/unsigned bytes and shorts)Vector256
shuffle withAvx2.Shuffle
(for signed/unsigned bytes and shorts)VectorXxx.Shuffle
with constant indices when not crossing lanes, when zeroing, etc.Todo tasks:
VectorXXX.ShuffleNative
for vectors of other element typesShuffle
inShuffleNative
fallbackUp-to-date codegen (note: the amount of cases I tested is perhaps a bit over the top, so check the c# code to see what cases you'd like to look at before looking through all the assembly - the main ones for non-constant indices are the
IndirectIndirect
ones; should include most improved scenarios): CodegenCodegen (outdated, some cases have gotten better, doesn't include codegen for all relevant platforms (e.g., AVX-512 and arm64), and doesn't include all improved scenarios):
Shuffle With AVX2
ShuffleUnsafe With AVX2
Shuffle With Sse4.2
ShuffleUnsafe With Sse4.2