-
Notifications
You must be signed in to change notification settings - Fork 754
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL][COMPAT] Add helper function ternary_logic_op() to perform bitwise logical operations on three input values based on the specified 8-bit truth table #16509
Conversation
This is new functionality so this PR is missing a e2e test. |
sycl/include/syclcompat/util.hpp
Outdated
/// \returns The result | ||
inline uint32_t lop3(uint32_t a, uint32_t b, uint32_t c, uint8_t lut) { | ||
uint32_t result = 0; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is better to use the optimized instructions for backends when available, so that translation does not reduce performance wrt cuda.
#if defined(__SYCL_DEVICE_ONLY__) && defined(__NVPTX__) | |
asm volatile("lop3.b32 %0, %1, %2, %3, %4;" | |
: "=r"(result) | |
: "r"(a), "r"(b), "r"(c), "r"(lut)); | |
#else |
See later corresponding #endif suggestion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is better to use the optimized instructions for backends when available, so that translation does not reduce performance wrt cuda.
See later corresponding #endif suggestion
I have refined the helper function. As to keeping asm PTX code in the helper function for SYCL CUDA backend, I think it is not necessary, as SYCLomatic has provided the option “--optimize-migration.” If this option is specified during migration, the PTX asm instruction will be kept in the migrated code, here is the demo case https://github.com/oneapi-src/SYCLomatic/blob/821800fb720a82403a4488d90ea8233cca45b918/clang/test/dpct/asm/optimize.cu#L11
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this imply that
- users may use syclomatic conversion potentially twice in cases where they want to support both cuda supported (optimized) and e.g. l0 or other backends (but without complex preprocessor directive directly in source code)? Then they have either two non-portable codes to maintain, or a code with lots of #ifdefs in for the cuda path: Doesn't this go completely against the philosophy of oneapi?
- Users (that want cuda performance) won't ever write sycl::compat code themselves (i.e. via reading documentation for functions they need), since the above point implies that they won't know what to write for their particular target unless they use the automatic translation?
The above situation also implies that sycl::compat will be able to generally maintain performance (non portably) for at most two backends (cuda and l0), since it dissuades further backend specific optimized implementation (in a portable manner) for any other backend e.g. HIP.
Isn't it better to provide the preprocessor #if #else abstraction inside of the sycl (compat) functions to enable simpler portable code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is really the goal of syclomatic/sycl::compat: then in order to not be disingenuous to users this needs to be explained in appropriate documentation: e.g.
somewhere in https://oneapi-src.github.io/SYCLomatic/get_started/index.html
"
- syclomatic may considerably reduce the performance of a translated cuda code on Nvidia GPUs unless programmers use the option
--optimize-migration
- If programmers use
--optimize-migration
then this translation will include preprocessor directives in kernels for optimized cuda backend paths (to run on Nvidia GPUs).
"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is worth clarifying that of course sometimes it won't make sense to write the asm (either within sycl::compat function or directly)! The example you gave:
// CHECK-NEXT: #if defined(__SYCL_DEVICE_ONLY__) && defined(__NVPTX__)
// CHECK-NEXT: asm("mov.s32 %0, %1;" : "=r"(a) : "r"(b));
// CHECK-NEXT: #else
// CHECK-NEXT: a = b;
// CHECK-NEXT: #endif
Is actually never going to be better than just a = b
since the compiler takes care of the lowering to the ptx instruction in this case.
There is actually a third case e.g. mov.b32
which is a ptx instruction that in some very specialized cases might be good to write directly (since an appropriate lowering might not be available in the compiler). Note that in such a case, I'm not sure an appropriate translation to intel gpus would generally exist (apart from at a much higher level involving lots of surrounding code: in this case usually for packed types: e.g. fp16x2), since this is usually for low level hardware feature support. In general I would avoid attempting to translate such code: such things are only typically used in library codes: it would be I think more sensible for syclomatic to give a message saying this isn't translatable and to consider manual porting with some deeper thinking.
But these are different cases to the one in this PR: which is a very specialized/optimized (but high level functional) ptx instruction (that apparently doesn't have a corresponding cuda runtime/math lib api) that therefore does not have a compiler lowering (and probably it doesn't make sense to add one), but does map to a simple high level sycl::compat function.
But in all such cases it is appropriate to deal with these on a case by case basis within sycl::compat (or other sycl headers): such has been the challenge of translators across the ages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
asm volatile("lop3.b32 %0, %1, %2, %3, %4;"
: "=r"(result)
: "r"(a), "r"(b), "r"(c), "r"(lut));
Okay, accept your advice and use ASM PTX instructions for the SYCL CUDA backend; I have addressed your comments in the updated commit; pls take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this imply that
- users may use syclomatic conversion potentially twice in cases where they want to support both cuda supported (optimized) and e.g. l0 or other backends (but without complex preprocessor directive directly in source code)? Then they have either two non-portable codes to maintain, or a code with lots of #ifdefs in for the cuda path: Doesn't this go completely against the philosophy of oneapi?
- Users (that want cuda performance) won't ever write sycl::compat code themselves (i.e. via reading documentation for functions they need), since the above point implies that they won't know what to write for their particular target unless they use the automatic translation?
The above situation also implies that sycl::compat will be able to generally maintain performance (non portably) for at most two backends (cuda and l0), since it dissuades further backend specific optimized implementation (in a portable manner) for any other backend e.g. HIP.
Isn't it better to provide the preprocessor #if #else abstraction inside of the sycl (compat) functions to enable simpler portable code?
Okay, accept your advice and use ASM PTX instructions for the SYCL CUDA backend; I have addressed your comments in the updated commit; pls take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this imply that
- users may use syclomatic conversion potentially twice in cases where they want to support both cuda supported (optimized) and e.g. l0 or other backends (but without complex preprocessor directive directly in source code)? Then they have either two non-portable codes to maintain, or a code with lots of #ifdefs in for the cuda path: Doesn't this go completely against the philosophy of oneapi?
- Users (that want cuda performance) won't ever write sycl::compat code themselves (i.e. via reading documentation for functions they need), since the above point implies that they won't know what to write for their particular target unless they use the automatic translation?
The above situation also implies that sycl::compat will be able to generally maintain performance (non portably) for at most two backends (cuda and l0), since it dissuades further backend specific optimized implementation (in a portable manner) for any other backend e.g. HIP.
Isn't it better to provide the preprocessor #if #else abstraction inside of the sycl (compat) functions to enable simpler portable code?
Okay, accept your advice and use ASM PTX instructions for the SYCL CUDA backend; I have addressed your comments in the updated commit; pls take a look.
sycl/include/syclcompat/util.hpp
Outdated
// Set the output bit in the result | ||
result |= (output_bit << i); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#endif // defined(__SYCL_DEVICE_ONLY__) && defined(__NVPTX__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, accept your advice and use ASM PTX instructions for the SYCL CUDA backend; I have addressed your comments in the updated commit; pls take a look.
7cca4f5
to
04825c6
Compare
Added. |
…operations on three input values based on the specified 8-bit truth table Signed-off-by: chenwei.sun <[email protected]>
Signed-off-by: chenwei.sun <[email protected]>
Signed-off-by: chenwei.sun <[email protected]>
Signed-off-by: chenwei.sun <[email protected]>
#if defined(__SYCL_DEVICE_ONLY__) && defined(__NVPTX__) | ||
asm volatile("lop3.b32 %0, %1, %2, %3, %4;" | ||
: "=r"(result) | ||
: "r"(a), "r"(b), "r"(c), "n"(lut)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"n"(lut)
means (see https://docs.nvidia.com/cuda/inline-ptx-assembly/index.html)
*The constraint "n" may be used for immediate integer operands with a known value. Example:
asm("add.u32 %0, %0, %1;" : "=r"(x) : "n"(42));
generates:
add.u32 r1, r1, 42;
*
As the feature is currently written, it allows that D doesn't have to be an immediate integer operand with a known value.
So as you currently have it, I think this will break if uint8_t lut
is not a compile time known value? Currently if you update your test to call syclcompat::ternary_logic_op(A, B, C, D);
with a runtime D value directly instead of using the switch statement in the test, then the test will probably break?
Did you mean to make this a templated function like
template <uint8_t lut>
inline uint32_t ternary_logic_op(uint32_t a, uint32_t b, uint32_t c,
uint8_t lut) {
similar to what you have here https://github.com/oneapi-src/SYCLomatic/pull/2592/files#diff-982ab0caadb86096f0fbd5ff5436717a83adf3feccfe149d3525f3725bff9af7R44
?
In which case you could update this function to a template as above and leave the ptx as it is. Alternatively you could replace "n" with "r", and then add new cases to your test to test runtime passing of syclcompat::ternary_logic_op(A, B, C, D);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the l0 path is the priority then the above recommended change can be considered a nit and I'll approve as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"n"(lut)
means (see https://docs.nvidia.com/cuda/inline-ptx-assembly/index.html)*The constraint "n" may be used for immediate integer operands with a known value. Example:
asm("add.u32 %0, %0, %1;" : "=r"(x) : "n"(42));
generates:
add.u32 r1, r1, 42; *
As the feature is currently written, it allows that D doesn't have to be an immediate integer operand with a known value. So as you currently have it, I think this will break if
uint8_t lut
is not a compile time known value? Currently if you update your test to callsyclcompat::ternary_logic_op(A, B, C, D);
with a runtime D value directly instead of using the switch statement in the test, then the test will probably break?Did you mean to make this a templated function like
template <uint8_t lut> inline uint32_t ternary_logic_op(uint32_t a, uint32_t b, uint32_t c, uint8_t lut) {
similar to what you have here https://github.com/oneapi-src/SYCLomatic/pull/2592/files#diff-982ab0caadb86096f0fbd5ff5436717a83adf3feccfe149d3525f3725bff9af7R44 ?
In which case you could update this function to a template as above and leave the ptx as it is. Alternatively you could replace "n" with "r", and then add new cases to your test to test runtime passing of
syclcompat::ternary_logic_op(A, B, C, D);
Hi @JackAKirk
The case(https://github.com/tomflinda/SYCLomatic/blob/821800fb720a82403a4488d90ea8233cca45b918/clang/test/dpct/asm/lop3.cu#L44) is a lit test to verify SYCLomatic migration. For the helper function, we do not necessarily limit the last parameter of ternary_logic_op as a compile-time known value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"n"(lut)
means (see https://docs.nvidia.com/cuda/inline-ptx-assembly/index.html)
*The constraint "n" may be used for immediate integer operands with a known value. Example:
asm("add.u32 %0, %0, %1;" : "=r"(x) : "n"(42));
generates:
add.u32 r1, r1, 42; *
As the feature is currently written, it allows that D doesn't have to be an immediate integer operand with a known value. So as you currently have it, I think this will break ifuint8_t lut
is not a compile time known value? Currently if you update your test to callsyclcompat::ternary_logic_op(A, B, C, D);
with a runtime D value directly instead of using the switch statement in the test, then the test will probably break?
Did you mean to make this a templated function liketemplate <uint8_t lut> inline uint32_t ternary_logic_op(uint32_t a, uint32_t b, uint32_t c, uint8_t lut) {
similar to what you have here https://github.com/oneapi-src/SYCLomatic/pull/2592/files#diff-982ab0caadb86096f0fbd5ff5436717a83adf3feccfe149d3525f3725bff9af7R44 ?
In which case you could update this function to a template as above and leave the ptx as it is. Alternatively you could replace "n" with "r", and then add new cases to your test to test runtime passing ofsyclcompat::ternary_logic_op(A, B, C, D);
Hi @JackAKirk The case(https://github.com/tomflinda/SYCLomatic/blob/821800fb720a82403a4488d90ea8233cca45b918/clang/test/dpct/asm/lop3.cu#L44) is a lit test to verify SYCLomatic migration. For the helper function, we do not necessarily limit the last parameter of ternary_logic_op as a compile-time known value.
In that case it will only be correct with the suggested changes described in #16509 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved with recommended changes suggested.
Signed-off-by: chenwei.sun <[email protected]>
@JackAKirk thanks. |
@intel/llvm-gatekeepers this is ready to merge |
Signed-off-by: chenwei.sun [email protected]