Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][COMPAT] Add helper function ternary_logic_op() to perform bitwise logical operations on three input values based on the specified 8-bit truth table #16509

Merged
merged 5 commits into from
Jan 20, 2025

Conversation

tomflinda
Copy link
Contributor

Signed-off-by: chenwei.sun [email protected]

@JackAKirk
Copy link
Contributor

This is new functionality so this PR is missing a e2e test.

/// \returns The result
inline uint32_t lop3(uint32_t a, uint32_t b, uint32_t c, uint8_t lut) {
uint32_t result = 0;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to use the optimized instructions for backends when available, so that translation does not reduce performance wrt cuda.

Suggested change
#if defined(__SYCL_DEVICE_ONLY__) && defined(__NVPTX__)
asm volatile("lop3.b32 %0, %1, %2, %3, %4;"
: "=r"(result)
: "r"(a), "r"(b), "r"(c), "r"(lut));
#else

See later corresponding #endif suggestion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to use the optimized instructions for backends when available, so that translation does not reduce performance wrt cuda.

See later corresponding #endif suggestion

I have refined the helper function. As to keeping asm PTX code in the helper function for SYCL CUDA backend, I think it is not necessary, as SYCLomatic has provided the option “--optimize-migration.” If this option is specified during migration, the PTX asm instruction will be kept in the migrated code, here is the demo case https://github.com/oneapi-src/SYCLomatic/blob/821800fb720a82403a4488d90ea8233cca45b918/clang/test/dpct/asm/optimize.cu#L11

Copy link
Contributor

@JackAKirk JackAKirk Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this imply that

  • users may use syclomatic conversion potentially twice in cases where they want to support both cuda supported (optimized) and e.g. l0 or other backends (but without complex preprocessor directive directly in source code)? Then they have either two non-portable codes to maintain, or a code with lots of #ifdefs in for the cuda path: Doesn't this go completely against the philosophy of oneapi?
  • Users (that want cuda performance) won't ever write sycl::compat code themselves (i.e. via reading documentation for functions they need), since the above point implies that they won't know what to write for their particular target unless they use the automatic translation?

The above situation also implies that sycl::compat will be able to generally maintain performance (non portably) for at most two backends (cuda and l0), since it dissuades further backend specific optimized implementation (in a portable manner) for any other backend e.g. HIP.

Isn't it better to provide the preprocessor #if #else abstraction inside of the sycl (compat) functions to enable simpler portable code?

Copy link
Contributor

@JackAKirk JackAKirk Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is really the goal of syclomatic/sycl::compat: then in order to not be disingenuous to users this needs to be explained in appropriate documentation: e.g.

somewhere in https://oneapi-src.github.io/SYCLomatic/get_started/index.html

"

  • syclomatic may considerably reduce the performance of a translated cuda code on Nvidia GPUs unless programmers use the option --optimize-migration
  • If programmers use --optimize-migration then this translation will include preprocessor directives in kernels for optimized cuda backend paths (to run on Nvidia GPUs).
    "

Copy link
Contributor

@JackAKirk JackAKirk Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is worth clarifying that of course sometimes it won't make sense to write the asm (either within sycl::compat function or directly)! The example you gave:

// CHECK-NEXT: #if defined(__SYCL_DEVICE_ONLY__) && defined(__NVPTX__)
// CHECK-NEXT:   asm("mov.s32 %0, %1;" : "=r"(a) : "r"(b));
// CHECK-NEXT: #else
// CHECK-NEXT:   a = b;
// CHECK-NEXT: #endif

Is actually never going to be better than just a = b since the compiler takes care of the lowering to the ptx instruction in this case.

There is actually a third case e.g. mov.b32 which is a ptx instruction that in some very specialized cases might be good to write directly (since an appropriate lowering might not be available in the compiler). Note that in such a case, I'm not sure an appropriate translation to intel gpus would generally exist (apart from at a much higher level involving lots of surrounding code: in this case usually for packed types: e.g. fp16x2), since this is usually for low level hardware feature support. In general I would avoid attempting to translate such code: such things are only typically used in library codes: it would be I think more sensible for syclomatic to give a message saying this isn't translatable and to consider manual porting with some deeper thinking.

But these are different cases to the one in this PR: which is a very specialized/optimized (but high level functional) ptx instruction (that apparently doesn't have a corresponding cuda runtime/math lib api) that therefore does not have a compiler lowering (and probably it doesn't make sense to add one), but does map to a simple high level sycl::compat function.

But in all such cases it is appropriate to deal with these on a case by case basis within sycl::compat (or other sycl headers): such has been the challenge of translators across the ages.

Copy link
Contributor Author

@tomflinda tomflinda Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

asm volatile("lop3.b32 %0, %1, %2, %3, %4;"
: "=r"(result)
: "r"(a), "r"(b), "r"(c), "r"(lut));

Okay, accept your advice and use ASM PTX instructions for the SYCL CUDA backend; I have addressed your comments in the updated commit; pls take a look.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this imply that

  • users may use syclomatic conversion potentially twice in cases where they want to support both cuda supported (optimized) and e.g. l0 or other backends (but without complex preprocessor directive directly in source code)? Then they have either two non-portable codes to maintain, or a code with lots of #ifdefs in for the cuda path: Doesn't this go completely against the philosophy of oneapi?
  • Users (that want cuda performance) won't ever write sycl::compat code themselves (i.e. via reading documentation for functions they need), since the above point implies that they won't know what to write for their particular target unless they use the automatic translation?

The above situation also implies that sycl::compat will be able to generally maintain performance (non portably) for at most two backends (cuda and l0), since it dissuades further backend specific optimized implementation (in a portable manner) for any other backend e.g. HIP.

Isn't it better to provide the preprocessor #if #else abstraction inside of the sycl (compat) functions to enable simpler portable code?

Okay, accept your advice and use ASM PTX instructions for the SYCL CUDA backend; I have addressed your comments in the updated commit; pls take a look.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this imply that

  • users may use syclomatic conversion potentially twice in cases where they want to support both cuda supported (optimized) and e.g. l0 or other backends (but without complex preprocessor directive directly in source code)? Then they have either two non-portable codes to maintain, or a code with lots of #ifdefs in for the cuda path: Doesn't this go completely against the philosophy of oneapi?
  • Users (that want cuda performance) won't ever write sycl::compat code themselves (i.e. via reading documentation for functions they need), since the above point implies that they won't know what to write for their particular target unless they use the automatic translation?

The above situation also implies that sycl::compat will be able to generally maintain performance (non portably) for at most two backends (cuda and l0), since it dissuades further backend specific optimized implementation (in a portable manner) for any other backend e.g. HIP.

Isn't it better to provide the preprocessor #if #else abstraction inside of the sycl (compat) functions to enable simpler portable code?

Okay, accept your advice and use ASM PTX instructions for the SYCL CUDA backend; I have addressed your comments in the updated commit; pls take a look.

// Set the output bit in the result
result |= (output_bit << i);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#endif // defined(__SYCL_DEVICE_ONLY__) && defined(__NVPTX__)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Copy link
Contributor Author

@tomflinda tomflinda Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, accept your advice and use ASM PTX instructions for the SYCL CUDA backend; I have addressed your comments in the updated commit; pls take a look.

@tomflinda tomflinda force-pushed the add_lop3_helper_function branch from 7cca4f5 to 04825c6 Compare January 13, 2025 03:50
@tomflinda tomflinda changed the title [SYCL][COMPAT] Add helper function lop3() to perform bitwise logical operations on three input values based on the specified 8-bit truth table [SYCL][COMPAT] Add helper function ternary_logic_op() to perform bitwise logical operations on three input values based on the specified 8-bit truth table Jan 13, 2025
@tomflinda
Copy link
Contributor Author

This is new functionality so this PR is missing a e2e test.

Added.

…operations on three input values based on the specified 8-bit truth table

Signed-off-by: chenwei.sun <[email protected]>
Signed-off-by: chenwei.sun <[email protected]>
Signed-off-by: chenwei.sun <[email protected]>
#if defined(__SYCL_DEVICE_ONLY__) && defined(__NVPTX__)
asm volatile("lop3.b32 %0, %1, %2, %3, %4;"
: "=r"(result)
: "r"(a), "r"(b), "r"(c), "n"(lut));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"n"(lut) means (see https://docs.nvidia.com/cuda/inline-ptx-assembly/index.html)

*The constraint "n" may be used for immediate integer operands with a known value. Example:

asm("add.u32 %0, %0, %1;" : "=r"(x) : "n"(42));

generates:

add.u32 r1, r1, 42;
*

As the feature is currently written, it allows that D doesn't have to be an immediate integer operand with a known value.
So as you currently have it, I think this will break if uint8_t lut is not a compile time known value? Currently if you update your test to call syclcompat::ternary_logic_op(A, B, C, D); with a runtime D value directly instead of using the switch statement in the test, then the test will probably break?

Did you mean to make this a templated function like

template <uint8_t lut>
inline uint32_t ternary_logic_op(uint32_t a, uint32_t b, uint32_t c,
                                 uint8_t lut) {

similar to what you have here https://github.com/oneapi-src/SYCLomatic/pull/2592/files#diff-982ab0caadb86096f0fbd5ff5436717a83adf3feccfe149d3525f3725bff9af7R44
?

In which case you could update this function to a template as above and leave the ptx as it is. Alternatively you could replace "n" with "r", and then add new cases to your test to test runtime passing of syclcompat::ternary_logic_op(A, B, C, D);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the l0 path is the priority then the above recommended change can be considered a nit and I'll approve as is.

Copy link
Contributor Author

@tomflinda tomflinda Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"n"(lut) means (see https://docs.nvidia.com/cuda/inline-ptx-assembly/index.html)

*The constraint "n" may be used for immediate integer operands with a known value. Example:

asm("add.u32 %0, %0, %1;" : "=r"(x) : "n"(42));

generates:

add.u32 r1, r1, 42; *

As the feature is currently written, it allows that D doesn't have to be an immediate integer operand with a known value. So as you currently have it, I think this will break if uint8_t lut is not a compile time known value? Currently if you update your test to call syclcompat::ternary_logic_op(A, B, C, D); with a runtime D value directly instead of using the switch statement in the test, then the test will probably break?

Did you mean to make this a templated function like

template <uint8_t lut>
inline uint32_t ternary_logic_op(uint32_t a, uint32_t b, uint32_t c,
                                 uint8_t lut) {

similar to what you have here https://github.com/oneapi-src/SYCLomatic/pull/2592/files#diff-982ab0caadb86096f0fbd5ff5436717a83adf3feccfe149d3525f3725bff9af7R44 ?

In which case you could update this function to a template as above and leave the ptx as it is. Alternatively you could replace "n" with "r", and then add new cases to your test to test runtime passing of syclcompat::ternary_logic_op(A, B, C, D);

Hi @JackAKirk
The case(https://github.com/tomflinda/SYCLomatic/blob/821800fb720a82403a4488d90ea8233cca45b918/clang/test/dpct/asm/lop3.cu#L44) is a lit test to verify SYCLomatic migration. For the helper function, we do not necessarily limit the last parameter of ternary_logic_op as a compile-time known value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"n"(lut) means (see https://docs.nvidia.com/cuda/inline-ptx-assembly/index.html)
*The constraint "n" may be used for immediate integer operands with a known value. Example:
asm("add.u32 %0, %0, %1;" : "=r"(x) : "n"(42));
generates:
add.u32 r1, r1, 42; *
As the feature is currently written, it allows that D doesn't have to be an immediate integer operand with a known value. So as you currently have it, I think this will break if uint8_t lut is not a compile time known value? Currently if you update your test to call syclcompat::ternary_logic_op(A, B, C, D); with a runtime D value directly instead of using the switch statement in the test, then the test will probably break?
Did you mean to make this a templated function like

template <uint8_t lut>
inline uint32_t ternary_logic_op(uint32_t a, uint32_t b, uint32_t c,
                                 uint8_t lut) {

similar to what you have here https://github.com/oneapi-src/SYCLomatic/pull/2592/files#diff-982ab0caadb86096f0fbd5ff5436717a83adf3feccfe149d3525f3725bff9af7R44 ?
In which case you could update this function to a template as above and leave the ptx as it is. Alternatively you could replace "n" with "r", and then add new cases to your test to test runtime passing of syclcompat::ternary_logic_op(A, B, C, D);

Hi @JackAKirk The case(https://github.com/tomflinda/SYCLomatic/blob/821800fb720a82403a4488d90ea8233cca45b918/clang/test/dpct/asm/lop3.cu#L44) is a lit test to verify SYCLomatic migration. For the helper function, we do not necessarily limit the last parameter of ternary_logic_op as a compile-time known value.

In that case it will only be correct with the suggested changes described in #16509 (comment)

Copy link
Contributor

@JackAKirk JackAKirk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with recommended changes suggested.

Signed-off-by: chenwei.sun <[email protected]>
@tomflinda
Copy link
Contributor Author

Approved with recommended changes suggested.

@JackAKirk thanks.

@zhiweij1
Copy link
Contributor

@intel/llvm-gatekeepers this is ready to merge

@martygrant martygrant merged commit 160509b into intel:sycl Jan 20, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants