-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide generic and safe C++ interfaces for warp shuffle: Issue #2976 #3210
base: main
Are you sure you want to change the base?
Conversation
thanks for the contribution, @soumikiith. I have a couple of initial comments.
|
I updated #2976 to better formalize the features and checks of these functions |
One Question: While computing laneid, can I use modulo operator ? Or is the preferable way to fetch it directly from assembly using asm instructions? Note that my doubt is only in the context of shfl_up and shfl_down. Also, why does a mask value need to be passed (I know that the default value is assigned) in shfl_xor? Is not passing lanemask sufficient ? |
you can use C++ API for PTX, see https://nvidia.github.io/cccl/libcudacxx/ptx/instructions/special_registers.html#laneid
Referring to the official documentation, |
Hi, I have added the checks (I need to fix the assertion statements, though). Please check them and let me know if this is meeting your expected requirements. I will soon commit the casting of different data types using Please let me know of any additional requirements. |
Hi, Merry Christmas !! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please also add the related tests
_LIBCUDACXX_BEGIN_NAMESPACE_CUDA | ||
template <typename T> | ||
constexpr bool is_supported_type_v = false; | ||
template <> constexpr bool is_supported_type_v<int> = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important. Please don't specialize for a fixed set of types. shuffle
needs to work with any trivially copyable (and construcible) data type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the documentation the warp level instructions only accept a set of arithmetic types https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-shuffle-functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we can get away with
template<class _Tp>
_CCCL_INLINE_VAR constexpr bool __can_warp_shuffle_v = (_CUDA_VSTD::is_arithmetic_v<_Tp> && sizeof(_Tp) >= sizeof(int)) || _CUDA_VSTD::__is_extended_floating_point_v<_Tp>
Not that
- This is a nonpublic helper so it needs to be
__ugly
that is also true for the template arguments - This is only valid if
_CCCL_HAS_NO_VARIABLE_TEMPLATES
is defined, otherwise you need to define a struct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is very common in CUDA to use warp shuffle to move types outside of the standard accepted types. Many libraries provide their own method for moving generic types. It makes sense to extend these functions for any trivially copyable type
|
||
//Input validation for shuffle operations | ||
void _CCCL_DEVICE validate_shuffle_inputs(int width, unsigned mask) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion. I would call this function validate_width_mask
//Input validation for shuffle operations | ||
void _CCCL_DEVICE validate_shuffle_inputs(int width, unsigned mask) | ||
{ | ||
_CCCL_ASSERT((width <= warpSize), "Width must not exceed warp size"); // width must not exceed warp size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important: width
must be greater or equal than zero
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please drop the additional comments. The wording of the assert should suffice
#endif | ||
|
||
//Input validation for shuffle operations | ||
void _CCCL_DEVICE validate_shuffle_inputs(int width, unsigned mask) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important: Please use fixed-size integers provided by cuda/std/cstdint
, e.g. ::cuda::std::uint32_t
{ | ||
_CCCL_ASSERT((width <= warpSize), "Width must not exceed warp size"); // width must not exceed warp size | ||
_CCCL_ASSERT((mask & __activemask()) == mask, "Mask must be a subset of the active mask"); // mask must be a subset of __activemask() | ||
_CCCL_ASSERT((width > 0 && (width & (width - 1)) == 0), "Width must be a power of two"); // width must be a power of two |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion. Please use ::cuda::std::has_single_bit
function instead
_CCCL_ASSERT(is_supported_type_v<T>, "T must be a supported type for warp shuffle operations"); // T must be a supported type for warp shuffle operations | ||
validate_shuffle_inputs(width, mask); // validate inputs (width and mask) | ||
_CCCL_ASSERT((srcLane >= 0 && srcLane < width), "srcLane must be in the range [0, width)"); // srcLane must be in the range [0, width) | ||
//scrLane mustbe part of mask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mustbe
-> must be
//implement the logic for shfl | ||
uint32_t buffer[sizeof(T) / sizeof(uint32_t)+1]; | ||
int numElements; | ||
to_32bitBuffer(var, buffer, numElements); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion. This logic can be greatly improved by using the copy semantic of cuda::std::array
returned by to_32bitBuffer
_CCCL_DEVICE void to_32bitBuffer(T& var, uint32_t* outArray, int& numElements) | ||
{ | ||
constexpr size_t typeSize = sizeof(T); | ||
constexpr int elements = (typeSize + sizeof(uint32_t) - 1) / sizeof(uint32_t); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion. Use cuda::ceil_div
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also use cuda::std::bit_cast
uint32_t buffer[sizeof(T) / sizeof(uint32_t)+1]; | ||
int numElements; | ||
to_32bitBuffer(var, buffer, numElements); | ||
for(int i=0;i<numElements;i++) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important. All loops with fixed number of iterations should be marked with #pragma unroll
@@ -0,0 +1,161 @@ | |||
|
|||
#ifndef _CUDA_FUNCTIONAL_SHUFFLE_SAFETY_H |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer if we could rename this to warp_shuffle.h or just warp.h, because that describes the intrinsics better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree. I like warp_shuffle.h
…::std::standard int instead of int. 3) Improved logic for to_32bitBuffer. 4) Used #pragma unroll before every loop. 5) Used cuda::std::array instead of normal array declarations and improved logic.
#include <cuda/std/type_traits> | ||
#include <cuda/std/bit> | ||
#include <cuda/std/memory> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please include the relevant subheaders only
#include <cuda/std/__cmath/nvfp16.h> | ||
#include <cuda/std/__cmath/nvbf16.h> | ||
|
||
#define _CCCL_HAS_CUDA_COMPILER 1 //fix for now -- to be deleted later |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this needs to go
#include <cuda/std/__cmath/nvfp16.h> | ||
#include <cuda/std/__cmath/nvbf16.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe those are the wrong includes, we probably only want to know whether something is an extended floating point type
_LIBCUDACXX_BEGIN_NAMESPACE_CUDA | ||
template <typename T> | ||
constexpr bool is_supported_type_v = false; | ||
template <> constexpr bool is_supported_type_v<int> = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we can get away with
template<class _Tp>
_CCCL_INLINE_VAR constexpr bool __can_warp_shuffle_v = (_CUDA_VSTD::is_arithmetic_v<_Tp> && sizeof(_Tp) >= sizeof(int)) || _CUDA_VSTD::__is_extended_floating_point_v<_Tp>
Not that
- This is a nonpublic helper so it needs to be
__ugly
that is also true for the template arguments - This is only valid if
_CCCL_HAS_NO_VARIABLE_TEMPLATES
is defined, otherwise you need to define a struct
//Input validation for shuffle operations | ||
void _CCCL_DEVICE validate_shuffle_inputs(int width, unsigned mask) | ||
{ | ||
_CCCL_ASSERT((width <= warpSize), "Width must not exceed warp size"); // width must not exceed warp size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please drop the additional comments. The wording of the assert should suffice
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is missing our license header
@@ -0,0 +1,161 @@ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is missing the license header
template <typename T> | ||
constexpr bool is_supported_type = cuda::std::is_trivially_copyable<T>::value; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs to be named something like __can_warp_shuffle
. Also this also needs to
- exclude types smaller than int
- allow extended floating point types like
__half
and__nv_bfloat16
} | ||
|
||
template<typename T> | ||
_CCCL_DEVICE void to_32bitBuffer(T& var, cuda::std::int32_t numElements) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those need to be __to_32bitBuffer
and __from_32bitBuffer
Also all variables need to be uglyfied
Unnecessary Comments removed. Co-authored-by: Michael Schellenberger Costa <[email protected]>
Co-authored-by: Michael Schellenberger Costa <[email protected]>
Description
closes #2976
I have provided generic and safe C++ interface for warp shuffle (shuffle_sync only for now). The safety features include: (1) checking for allowable data types, (2) handling of variables that consists of 4 bytes (32 bits).
Soon, I will post the feature to handle 16 bit and 64 bit data types.
Provide generic and safe C++ interfaces for warp shuffle: Issue #2976
Checklist