-
Notifications
You must be signed in to change notification settings - Fork 638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GPU] Enable GEMMs to first attempt LLVMGPUTileAndFuse with intrinsic by default #19520
base: main
Are you sure you want to change the base?
[GPU] Enable GEMMs to first attempt LLVMGPUTileAndFuse with intrinsic by default #19520
Conversation
38f5a22
to
7d687d7
Compare
There are compiler failures in the regression suite models, converting to draft while I debug |
7d687d7
to
7e2cdf8
Compare
The problem was a missing functionality for GEMMs of the type (f16,f16) ->f16. I filed this issue for it |
e6aa895
to
3bc822c
Compare
Found another issue with accumulating GEMMs #19546 |
2adc85d
to
2111358
Compare
Also need to disable prefetching when using c promotion due to this issue #19612 |
210ef2a
to
017e558
Compare
Signed-off-by: Nirvedh <[email protected]> Signed-off-by: Nirvedh Meshram <[email protected]>
Signed-off-by: Nirvedh Meshram <[email protected]>
017e558
to
982856b
Compare
Based on comparisons with iree-kernel-benchmark here The performance between VectorDistribute vs TileAndFuse when using intrinisics seem comparable. Note that none of the tests in the sheet used the padding extension available in TileAndFuse after, #19484
so its a fair comparison of the pipelines themselves. TileAndFuse in some cases did have a speed up that seems beyond the noise level and overall it averages out to 1.25x faster.
However, we will be looking at LLAMA and SDXL numbers before actually considering this PR for merging,
Fixes : #18858