[SYCL][COMPAT] Add helper function ternary_logic_op() to perform bitwise logical operations on three input values based on the specified 8-bit truth table #16509

tomflinda · 2025-01-03T01:26:13Z

Signed-off-by: chenwei.sun [email protected]

JackAKirk · 2025-01-03T11:35:27Z

This is new functionality so this PR is missing a e2e test.

JackAKirk · 2025-01-03T11:34:09Z

sycl/include/syclcompat/util.hpp

+/// \returns The result
+inline uint32_t lop3(uint32_t a, uint32_t b, uint32_t c, uint8_t lut) {
+  uint32_t result = 0;
+


It is better to use the optimized instructions for backends when available, so that translation does not reduce performance wrt cuda.

Suggested change

#if defined(__SYCL_DEVICE_ONLY__) && defined(__NVPTX__)

asm volatile("lop3.b32 %0, %1, %2, %3, %4;"

: "=r"(result)

: "r"(a), "r"(b), "r"(c), "r"(lut));

#else

See later corresponding #endif suggestion

It is better to use the optimized instructions for backends when available, so that translation does not reduce performance wrt cuda.

See later corresponding #endif suggestion

I have refined the helper function. As to keeping asm PTX code in the helper function for SYCL CUDA backend, I think it is not necessary, as SYCLomatic has provided the option “--optimize-migration.” If this option is specified during migration, the PTX asm instruction will be kept in the migrated code, here is the demo case https://github.com/oneapi-src/SYCLomatic/blob/821800fb720a82403a4488d90ea8233cca45b918/clang/test/dpct/asm/optimize.cu#L11

Doesn't this imply that

users may use syclomatic conversion potentially twice in cases where they want to support both cuda supported (optimized) and e.g. l0 or other backends (but without complex preprocessor directive directly in source code)? Then they have either two non-portable codes to maintain, or a code with lots of #ifdefs in for the cuda path: Doesn't this go completely against the philosophy of oneapi?

Users (that want cuda performance) won't ever write sycl::compat code themselves (i.e. via reading documentation for functions they need), since the above point implies that they won't know what to write for their particular target unless they use the automatic translation?

The above situation also implies that sycl::compat will be able to generally maintain performance (non portably) for at most two backends (cuda and l0), since it dissuades further backend specific optimized implementation (in a portable manner) for any other backend e.g. HIP.

Isn't it better to provide the preprocessor #if #else abstraction inside of the sycl (compat) functions to enable simpler portable code?

If this is really the goal of syclomatic/sycl::compat: then in order to not be disingenuous to users this needs to be explained in appropriate documentation: e.g.

somewhere in https://oneapi-src.github.io/SYCLomatic/get_started/index.html

"

syclomatic may considerably reduce the performance of a translated cuda code on Nvidia GPUs unless programmers use the option --optimize-migration

If programmers use --optimize-migration then this translation will include preprocessor directives in kernels for optimized cuda backend paths (to run on Nvidia GPUs).
"

It is worth clarifying that of course sometimes it won't make sense to write the asm (either within sycl::compat function or directly)! The example you gave:

// CHECK-NEXT: #if defined(__SYCL_DEVICE_ONLY__) && defined(__NVPTX__) // CHECK-NEXT: asm("mov.s32 %0, %1;" : "=r"(a) : "r"(b)); // CHECK-NEXT: #else // CHECK-NEXT: a = b; // CHECK-NEXT: #endif

Is actually never going to be better than just a = b since the compiler takes care of the lowering to the ptx instruction in this case.

There is actually a third case e.g. mov.b32 which is a ptx instruction that in some very specialized cases might be good to write directly (since an appropriate lowering might not be available in the compiler). Note that in such a case, I'm not sure an appropriate translation to intel gpus would generally exist (apart from at a much higher level involving lots of surrounding code: in this case usually for packed types: e.g. fp16x2), since this is usually for low level hardware feature support. In general I would avoid attempting to translate such code: such things are only typically used in library codes: it would be I think more sensible for syclomatic to give a message saying this isn't translatable and to consider manual porting with some deeper thinking.

But these are different cases to the one in this PR: which is a very specialized/optimized (but high level functional) ptx instruction (that apparently doesn't have a corresponding cuda runtime/math lib api) that therefore does not have a compiler lowering (and probably it doesn't make sense to add one), but does map to a simple high level sycl::compat function.

But in all such cases it is appropriate to deal with these on a case by case basis within sycl::compat (or other sycl headers): such has been the challenge of translators across the ages.

asm volatile("lop3.b32 %0, %1, %2, %3, %4;"
: "=r"(result)
: "r"(a), "r"(b), "r"(c), "r"(lut));

Okay, accept your advice and use ASM PTX instructions for the SYCL CUDA backend; I have addressed your comments in the updated commit; pls take a look.

Doesn't this imply that

users may use syclomatic conversion potentially twice in cases where they want to support both cuda supported (optimized) and e.g. l0 or other backends (but without complex preprocessor directive directly in source code)? Then they have either two non-portable codes to maintain, or a code with lots of #ifdefs in for the cuda path: Doesn't this go completely against the philosophy of oneapi?

Users (that want cuda performance) won't ever write sycl::compat code themselves (i.e. via reading documentation for functions they need), since the above point implies that they won't know what to write for their particular target unless they use the automatic translation?

The above situation also implies that sycl::compat will be able to generally maintain performance (non portably) for at most two backends (cuda and l0), since it dissuades further backend specific optimized implementation (in a portable manner) for any other backend e.g. HIP.

Isn't it better to provide the preprocessor #if #else abstraction inside of the sycl (compat) functions to enable simpler portable code?

Okay, accept your advice and use ASM PTX instructions for the SYCL CUDA backend; I have addressed your comments in the updated commit; pls take a look.

JackAKirk · 2025-01-03T11:34:55Z

sycl/include/syclcompat/util.hpp

+    // Set the output bit in the result
+    result |= (output_bit << i);
+  }
+


Suggested change

#endif // defined(__SYCL_DEVICE_ONLY__) && defined(__NVPTX__)

Same as above.

Okay, accept your advice and use ASM PTX instructions for the SYCL CUDA backend; I have addressed your comments in the updated commit; pls take a look.

tomflinda · 2025-01-13T03:51:17Z

This is new functionality so this PR is missing a e2e test.

Added.

…operations on three input values based on the specified 8-bit truth table Signed-off-by: chenwei.sun <[email protected]>

Signed-off-by: chenwei.sun <[email protected]>

JackAKirk · 2025-01-14T11:12:30Z

sycl/include/syclcompat/util.hpp

+#if defined(__SYCL_DEVICE_ONLY__) && defined(__NVPTX__)
+  asm volatile("lop3.b32 %0, %1, %2, %3, %4;"
+               : "=r"(result)
+               : "r"(a), "r"(b), "r"(c), "n"(lut));


"n"(lut) means (see https://docs.nvidia.com/cuda/inline-ptx-assembly/index.html)

*The constraint "n" may be used for immediate integer operands with a known value. Example:

asm("add.u32 %0, %0, %1;" : "=r"(x) : "n"(42));

generates:

add.u32 r1, r1, 42;
*

As the feature is currently written, it allows that D doesn't have to be an immediate integer operand with a known value.
So as you currently have it, I think this will break if uint8_t lut is not a compile time known value? Currently if you update your test to call syclcompat::ternary_logic_op(A, B, C, D); with a runtime D value directly instead of using the switch statement in the test, then the test will probably break?

Did you mean to make this a templated function like

template <uint8_t lut> inline uint32_t ternary_logic_op(uint32_t a, uint32_t b, uint32_t c, uint8_t lut) {

similar to what you have here https://github.com/oneapi-src/SYCLomatic/pull/2592/files#diff-982ab0caadb86096f0fbd5ff5436717a83adf3feccfe149d3525f3725bff9af7R44
?

In which case you could update this function to a template as above and leave the ptx as it is. Alternatively you could replace "n" with "r", and then add new cases to your test to test runtime passing of syclcompat::ternary_logic_op(A, B, C, D);

Since the l0 path is the priority then the above recommended change can be considered a nit and I'll approve as is.

"n"(lut) means (see https://docs.nvidia.com/cuda/inline-ptx-assembly/index.html)

*The constraint "n" may be used for immediate integer operands with a known value. Example:

asm("add.u32 %0, %0, %1;" : "=r"(x) : "n"(42));

generates:

add.u32 r1, r1, 42; *

As the feature is currently written, it allows that D doesn't have to be an immediate integer operand with a known value. So as you currently have it, I think this will break if uint8_t lut is not a compile time known value? Currently if you update your test to call syclcompat::ternary_logic_op(A, B, C, D); with a runtime D value directly instead of using the switch statement in the test, then the test will probably break?

Did you mean to make this a templated function like

template <uint8_t lut> inline uint32_t ternary_logic_op(uint32_t a, uint32_t b, uint32_t c, uint8_t lut) {

similar to what you have here https://github.com/oneapi-src/SYCLomatic/pull/2592/files#diff-982ab0caadb86096f0fbd5ff5436717a83adf3feccfe149d3525f3725bff9af7R44 ?

In which case you could update this function to a template as above and leave the ptx as it is. Alternatively you could replace "n" with "r", and then add new cases to your test to test runtime passing of syclcompat::ternary_logic_op(A, B, C, D);

Hi @JackAKirk
The case(https://github.com/tomflinda/SYCLomatic/blob/821800fb720a82403a4488d90ea8233cca45b918/clang/test/dpct/asm/lop3.cu#L44) is a lit test to verify SYCLomatic migration. For the helper function, we do not necessarily limit the last parameter of ternary_logic_op as a compile-time known value.

"n"(lut) means (see https://docs.nvidia.com/cuda/inline-ptx-assembly/index.html)
*The constraint "n" may be used for immediate integer operands with a known value. Example:
asm("add.u32 %0, %0, %1;" : "=r"(x) : "n"(42));
generates:
add.u32 r1, r1, 42; *
As the feature is currently written, it allows that D doesn't have to be an immediate integer operand with a known value. So as you currently have it, I think this will break if uint8_t lut is not a compile time known value? Currently if you update your test to call syclcompat::ternary_logic_op(A, B, C, D); with a runtime D value directly instead of using the switch statement in the test, then the test will probably break?
Did you mean to make this a templated function like

template <uint8_t lut> inline uint32_t ternary_logic_op(uint32_t a, uint32_t b, uint32_t c, uint8_t lut) {

similar to what you have here https://github.com/oneapi-src/SYCLomatic/pull/2592/files#diff-982ab0caadb86096f0fbd5ff5436717a83adf3feccfe149d3525f3725bff9af7R44 ?
In which case you could update this function to a template as above and leave the ptx as it is. Alternatively you could replace "n" with "r", and then add new cases to your test to test runtime passing of syclcompat::ternary_logic_op(A, B, C, D);

Hi @JackAKirk The case(https://github.com/tomflinda/SYCLomatic/blob/821800fb720a82403a4488d90ea8233cca45b918/clang/test/dpct/asm/lop3.cu#L44) is a lit test to verify SYCLomatic migration. For the helper function, we do not necessarily limit the last parameter of ternary_logic_op as a compile-time known value.

In that case it will only be correct with the suggested changes described in #16509 (comment)

JackAKirk

Approved with recommended changes suggested.

Signed-off-by: chenwei.sun <[email protected]>

tomflinda · 2025-01-16T00:13:47Z

Approved with recommended changes suggested.

@JackAKirk thanks.

zhiweij1 · 2025-01-20T08:29:15Z

@intel/llvm-gatekeepers this is ready to merge

tomflinda requested a review from a team as a code owner January 3, 2025 01:26

tomflinda temporarily deployed to WindowsCILock January 3, 2025 01:26 — with GitHub Actions Inactive

tomflinda mentioned this pull request Jan 3, 2025

[SYCLomatic][PTX] Refine migration of PTX asm instruction "lop3.b32" oneapi-src/SYCLomatic#2592

Merged

tomflinda temporarily deployed to WindowsCILock January 3, 2025 02:27 — with GitHub Actions Inactive

JackAKirk requested changes Jan 3, 2025

View reviewed changes

tomflinda force-pushed the add_lop3_helper_function branch from 7cca4f5 to 04825c6 Compare January 13, 2025 03:50

tomflinda had a problem deploying to WindowsCILock January 13, 2025 03:50 — with GitHub Actions Error

tomflinda temporarily deployed to WindowsCILock January 13, 2025 05:13 — with GitHub Actions Inactive

tomflinda temporarily deployed to WindowsCILock January 13, 2025 05:41 — with GitHub Actions Inactive

tomflinda added 4 commits January 14, 2025 02:50

[SYCL][COMPAT] Add helper function lop3() to perform bitwise logical …

1321810

…operations on three input values based on the specified 8-bit truth table Signed-off-by: chenwei.sun <[email protected]>

Add e2e test

abfa035

Signed-off-by: chenwei.sun <[email protected]>

Refine helper function and update e2e test

04825c6

Signed-off-by: chenwei.sun <[email protected]>

Fix ci test failure

a4c2f49

Signed-off-by: chenwei.sun <[email protected]>

tomflinda temporarily deployed to WindowsCILock January 14, 2025 01:56 — with GitHub Actions Inactive

tomflinda temporarily deployed to WindowsCILock January 14, 2025 02:24 — with GitHub Actions Inactive

JackAKirk requested changes Jan 14, 2025

View reviewed changes

JackAKirk approved these changes Jan 14, 2025

View reviewed changes

Address review comments

e2a7c0f

Signed-off-by: chenwei.sun <[email protected]>

martygrant merged commit 160509b into intel:sycl Jan 20, 2025
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL][COMPAT] Add helper function ternary_logic_op() to perform bitwise logical operations on three input values based on the specified 8-bit truth table #16509

[SYCL][COMPAT] Add helper function ternary_logic_op() to perform bitwise logical operations on three input values based on the specified 8-bit truth table #16509

tomflinda commented Jan 3, 2025

JackAKirk commented Jan 3, 2025

JackAKirk Jan 3, 2025

tomflinda Jan 13, 2025

JackAKirk Jan 13, 2025 •

edited

Loading

JackAKirk Jan 13, 2025 •

edited

Loading

JackAKirk Jan 13, 2025 •

edited

Loading

tomflinda Jan 14, 2025 •

edited

Loading

tomflinda Jan 14, 2025

tomflinda Jan 14, 2025

JackAKirk Jan 3, 2025

tomflinda Jan 13, 2025

tomflinda Jan 14, 2025 •

edited

Loading

tomflinda commented Jan 13, 2025

JackAKirk Jan 14, 2025

JackAKirk Jan 14, 2025

tomflinda Jan 16, 2025 •

edited

Loading

JackAKirk Jan 16, 2025

JackAKirk left a comment

tomflinda commented Jan 16, 2025

zhiweij1 commented Jan 20, 2025


	#endif // defined(__SYCL_DEVICE_ONLY__) && defined(__NVPTX__)

[SYCL][COMPAT] Add helper function ternary_logic_op() to perform bitwise logical operations on three input values based on the specified 8-bit truth table #16509

[SYCL][COMPAT] Add helper function ternary_logic_op() to perform bitwise logical operations on three input values based on the specified 8-bit truth table #16509

Conversation

tomflinda commented Jan 3, 2025

JackAKirk commented Jan 3, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JackAKirk Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

JackAKirk Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

JackAKirk Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

tomflinda Jan 14, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomflinda Jan 14, 2025 • edited Loading

Choose a reason for hiding this comment

tomflinda commented Jan 13, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomflinda Jan 16, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JackAKirk left a comment

Choose a reason for hiding this comment

tomflinda commented Jan 16, 2025

zhiweij1 commented Jan 20, 2025

JackAKirk Jan 13, 2025 •

edited

Loading

JackAKirk Jan 13, 2025 •

edited

Loading

JackAKirk Jan 13, 2025 •

edited

Loading

tomflinda Jan 14, 2025 •

edited

Loading

tomflinda Jan 14, 2025 •

edited

Loading

tomflinda Jan 16, 2025 •

edited

Loading