Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[BACKEND] Fix
__builtin_clz
implementation on Windows (#5774)
This PR fixes a typo in the Windows implementation of `__builtin_clz` that was introduced in #5621. According to [this in-code comment](https://github.com/triton-lang/triton/blob/b3dcc32f387d1d54ccd6cbbbc087296c0539e703/lib/Conversion/TritonGPUToLLVM/Utility.cpp#L12) these Windows implementations should have been copied from [this gist snippet](https://gist.github.com/pps83/3210a2f980fd02bb2ba2e5a1fc4a2ef0). In the snippet however the `clz` implementation additionally [XORs the result of `_BitScanReverse`](https://gist.github.com/pps83/3210a2f980fd02bb2ba2e5a1fc4a2ef0#file-ctz_clz-cpp-L51-L53) in order to convert the result from the <i>most significant bit</i> produced by `_BitScanReverse` to the expected <i>number of leading zeros</i>. I believe the implementation was copied to the triton without the finalizing XOR by accident. <b>What is affected by this error?</b> This implementation of CLZ is used in [`pext_i32`](https://github.com/intel/intel-xpu-backend-for-triton/blob/4a9967137548f8fe9b1a93383e4fd12646352231/lib/Conversion/TritonGPUToLLVM/Utility.cpp#L635) that is used in [`delinearize`](https://github.com/intel/intel-xpu-backend-for-triton/blob/4a9967137548f8fe9b1a93383e4fd12646352231/lib/Conversion/TritonGPUToLLVM/Utility.cpp#L662) that is used by [`ReduceOpToLLVM`](https://github.com/intel/intel-xpu-backend-for-triton/blob/4a9967137548f8fe9b1a93383e4fd12646352231/lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp#L243-L247) pattern. This bug caused `tt.reduce()` ops to be incorrectly lowered on Windows in cases, where shared memory is needed to store temporary reduced results. Signed-off-by: dchigarev <[email protected]>
- Loading branch information