-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Block layout of partition loop not ideal #108794
Comments
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
You are right about the code produced when tiering disabled, however, if we run the tiering and let the JIT collect profile data, we make better decision of pushing out the cold block outside of loop. If you see below, from profile, we find out that Diffs comparison of As for the compactness/reuse portion of loop, that is something we can investigate further. |
Unfortunately, I've been noticing this and many similar block ordering issues too (with .NET 9 regressing some further where careful ordering of the code is no longer sufficient), and tiering does not apply to NativeAOT, which is an important target. I hope that the compiler gaps like these will not be relegated to DPGO cleaning them up. Thanks |
@dotnet/jit-contrib |
If you have any examples handy, I'd like to take a look. We have plans to address block layout more aggressively in .NET 10 (#107749), and it would be nice to have candidates like the above to improve upon. Our block layout plans are quite reliant on profile data, though the JIT frontend has some tricks to determine which blocks are important in the absence of profile data (for example, see |
Don't have anything on hand right now but will keep an eye out next time I open Ghidra 🙂. Some of them look like this one: #93536 except torn loops or "why does this even have jump threading" cases appear under different conditions now. In any case, thanks for looking into this. |
This is good to see - I didn't realise profile data was being used this way. However, this shouldn't need profile data to optimise.
Ok, great. If I see anything else, I'll make sure to raise issues or comment here. |
Consider:
All entries less than then
input[0]
go intoleft
, everything else goes intoright
.This is the main routine inside a quicksort.
On Arm64 this runs at half the performance of the equivalent C++ version compiled by GCC.
I suspect X86 has similar issues but I haven't tried yet.
The C# disassembly looks reasonable:
However the C++ version is a little more cunning:
Due to the block ordering there are fewer branches.
In coreclr, there are always two jumps per loop iteration.
In gcc, there are zero to two jumps per loop iteration.
This difference in block order has a huge impact on performance, halving the performance of the coreclr version
The text was updated successfully, but these errors were encountered: