Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide generic and safe C++ interfaces for warp shuffle: Issue #2976 #3210

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

soumikiith
Copy link

@soumikiith soumikiith commented Dec 20, 2024

Description

closes #2976

I have provided generic and safe C++ interface for warp shuffle (shuffle_sync only for now). The safety features include: (1) checking for allowable data types, (2) handling of variables that consists of 4 bytes (32 bits).
Soon, I will post the feature to handle 16 bit and 64 bit data types.

Provide generic and safe C++ interfaces for warp shuffle: Issue #2976

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@soumikiith soumikiith requested review from a team as code owners December 20, 2024 13:01
Copy link

copy-pr-bot bot commented Dec 20, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@fbusato
Copy link
Contributor

fbusato commented Dec 20, 2024

thanks for the contribution, @soumikiith. I have a couple of initial comments.

  • cmath provides a set of mathematical operations, while warp shuffles are about data movement. I would create another header cuda/shuffle.
  • you don't need to handle all data types one by one, or by size. My suggestion is to create an array of uint32_t and then use memcpy. Even better if you find a way to use bit_cast.

@fbusato
Copy link
Contributor

fbusato commented Dec 20, 2024

I updated #2976 to better formalize the features and checks of these functions

@soumikiith
Copy link
Author

soumikiith commented Dec 21, 2024

One Question:

While computing laneid, can I use modulo operator ? Or is the preferable way to fetch it directly from assembly using asm instructions?

Note that my doubt is only in the context of shfl_up and shfl_down.

Also, why does a mask value need to be passed (I know that the default value is assigned) in shfl_xor? Is not passing lanemask sufficient ?

@fbusato
Copy link
Contributor

fbusato commented Dec 23, 2024

While computing laneid, can I use modulo operator ? Or is the preferable way to fetch it directly from assembly using asm instructions?

you can use C++ API for PTX, see https://nvidia.github.io/cccl/libcudacxx/ptx/instructions/special_registers.html#laneid

Also, why does a mask value need to be passed (I know that the default value is assigned) in shfl_xor? Is not passing lanemask sufficient ?

Referring to the official documentation, laneMask and mask have different meaning. mask represents the active lanes, while laneMask is the value to apply to the XOR operator, i.e. laneid() ^ laneMask

@soumikiith
Copy link
Author

Hi, I have added the checks (I need to fix the assertion statements, though). Please check them and let me know if this is meeting your expected requirements. I will soon commit the casting of different data types using memcpy.

Please let me know of any additional requirements.

@soumikiith
Copy link
Author

Hi,
I have added the code to do the __shfl operations for various data types. Please let me know if anything is to be added or if anything is flawed. I will happily revise my code.

Merry Christmas !!

Copy link
Contributor

@fbusato fbusato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please also add the related tests

_LIBCUDACXX_BEGIN_NAMESPACE_CUDA
template <typename T>
constexpr bool is_supported_type_v = false;
template <> constexpr bool is_supported_type_v<int> = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important. Please don't specialize for a fixed set of types. shuffle needs to work with any trivially copyable (and construcible) data type

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the documentation the warp level instructions only accept a set of arithmetic types https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-shuffle-functions

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we can get away with

template<class _Tp>
_CCCL_INLINE_VAR constexpr bool __can_warp_shuffle_v = (_CUDA_VSTD::is_arithmetic_v<_Tp> && sizeof(_Tp) >= sizeof(int)) || _CUDA_VSTD::__is_extended_floating_point_v<_Tp>

Not that

  • This is a nonpublic helper so it needs to be __ugly that is also true for the template arguments
  • This is only valid if _CCCL_HAS_NO_VARIABLE_TEMPLATES is defined, otherwise you need to define a struct

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is very common in CUDA to use warp shuffle to move types outside of the standard accepted types. Many libraries provide their own method for moving generic types. It makes sense to extend these functions for any trivially copyable type

libcudacxx/include/cuda/__shuffle/safe_shuffle.h Outdated Show resolved Hide resolved

//Input validation for shuffle operations
void _CCCL_DEVICE validate_shuffle_inputs(int width, unsigned mask)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion. I would call this function validate_width_mask

//Input validation for shuffle operations
void _CCCL_DEVICE validate_shuffle_inputs(int width, unsigned mask)
{
_CCCL_ASSERT((width <= warpSize), "Width must not exceed warp size"); // width must not exceed warp size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: width must be greater or equal than zero

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please drop the additional comments. The wording of the assert should suffice

#endif

//Input validation for shuffle operations
void _CCCL_DEVICE validate_shuffle_inputs(int width, unsigned mask)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: Please use fixed-size integers provided by cuda/std/cstdint, e.g. ::cuda::std::uint32_t

{
_CCCL_ASSERT((width <= warpSize), "Width must not exceed warp size"); // width must not exceed warp size
_CCCL_ASSERT((mask & __activemask()) == mask, "Mask must be a subset of the active mask"); // mask must be a subset of __activemask()
_CCCL_ASSERT((width > 0 && (width & (width - 1)) == 0), "Width must be a power of two"); // width must be a power of two
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion. Please use ::cuda::std::has_single_bit function instead

_CCCL_ASSERT(is_supported_type_v<T>, "T must be a supported type for warp shuffle operations"); // T must be a supported type for warp shuffle operations
validate_shuffle_inputs(width, mask); // validate inputs (width and mask)
_CCCL_ASSERT((srcLane >= 0 && srcLane < width), "srcLane must be in the range [0, width)"); // srcLane must be in the range [0, width)
//scrLane mustbe part of mask
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mustbe -> must be

//implement the logic for shfl
uint32_t buffer[sizeof(T) / sizeof(uint32_t)+1];
int numElements;
to_32bitBuffer(var, buffer, numElements);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion. This logic can be greatly improved by using the copy semantic of cuda::std::array returned by to_32bitBuffer

_CCCL_DEVICE void to_32bitBuffer(T& var, uint32_t* outArray, int& numElements)
{
constexpr size_t typeSize = sizeof(T);
constexpr int elements = (typeSize + sizeof(uint32_t) - 1) / sizeof(uint32_t);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion. Use cuda::ceil_div

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also use cuda::std::bit_cast

uint32_t buffer[sizeof(T) / sizeof(uint32_t)+1];
int numElements;
to_32bitBuffer(var, buffer, numElements);
for(int i=0;i<numElements;i++)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important. All loops with fixed number of iterations should be marked with #pragma unroll

@@ -0,0 +1,161 @@

#ifndef _CUDA_FUNCTIONAL_SHUFFLE_SAFETY_H
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer if we could rename this to warp_shuffle.h or just warp.h, because that describes the intrinsics better

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree. I like warp_shuffle.h

…::std::standard int instead of int. 3) Improved logic for to_32bitBuffer. 4) Used #pragma unroll before every loop. 5) Used cuda::std::array instead of normal array declarations and improved logic.
Comment on lines 15 to 17
#include <cuda/std/type_traits>
#include <cuda/std/bit>
#include <cuda/std/memory>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include the relevant subheaders only

#include <cuda/std/__cmath/nvfp16.h>
#include <cuda/std/__cmath/nvbf16.h>

#define _CCCL_HAS_CUDA_COMPILER 1 //fix for now -- to be deleted later
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this needs to go

Comment on lines 20 to 21
#include <cuda/std/__cmath/nvfp16.h>
#include <cuda/std/__cmath/nvbf16.h>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe those are the wrong includes, we probably only want to know whether something is an extended floating point type

_LIBCUDACXX_BEGIN_NAMESPACE_CUDA
template <typename T>
constexpr bool is_supported_type_v = false;
template <> constexpr bool is_supported_type_v<int> = true;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we can get away with

template<class _Tp>
_CCCL_INLINE_VAR constexpr bool __can_warp_shuffle_v = (_CUDA_VSTD::is_arithmetic_v<_Tp> && sizeof(_Tp) >= sizeof(int)) || _CUDA_VSTD::__is_extended_floating_point_v<_Tp>

Not that

  • This is a nonpublic helper so it needs to be __ugly that is also true for the template arguments
  • This is only valid if _CCCL_HAS_NO_VARIABLE_TEMPLATES is defined, otherwise you need to define a struct

//Input validation for shuffle operations
void _CCCL_DEVICE validate_shuffle_inputs(int width, unsigned mask)
{
_CCCL_ASSERT((width <= warpSize), "Width must not exceed warp size"); // width must not exceed warp size
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please drop the additional comments. The wording of the assert should suffice

Comment on lines +1 to +3



Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is missing our license header

@@ -0,0 +1,161 @@

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is missing the license header

Comment on lines +29 to +30
template <typename T>
constexpr bool is_supported_type = cuda::std::is_trivially_copyable<T>::value;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be named something like __can_warp_shuffle. Also this also needs to

  • exclude types smaller than int
  • allow extended floating point types like __half and __nv_bfloat16

libcudacxx/include/cuda/__shuffle/warp_shuffle.h Outdated Show resolved Hide resolved
}

template<typename T>
_CCCL_DEVICE void to_32bitBuffer(T& var, cuda::std::int32_t numElements)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those need to be __to_32bitBuffer and __from_32bitBuffer

Also all variables need to be uglyfied

soumikiith and others added 3 commits January 6, 2025 14:10
Unnecessary Comments removed.

Co-authored-by: Michael Schellenberger Costa <[email protected]>
Co-authored-by: Michael Schellenberger Costa <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

[FEA]: Provide generic and safe C++ interfaces for warp shuffle
3 participants