Provide generic and safe C++ interfaces for warp shuffle: Issue #2976 #3210

soumikiith · 2024-12-20T13:01:30Z

Description

I have provided generic and safe C++ interface for warp shuffle (shuffle_sync only for now). The safety features include: (1) checking for allowable data types, (2) handling of variables that consists of 4 bytes (32 bits).
Soon, I will post the feature to handle 16 bit and 64 bit data types.

Provide generic and safe C++ interfaces for warp shuffle: Issue #2976

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

…A#2976

copy-pr-bot · 2024-12-20T13:01:35Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

fbusato · 2024-12-20T17:35:46Z

thanks for the contribution, @soumikiith. I have a couple of initial comments.

cmath provides a set of mathematical operations, while warp shuffles are about data movement. I would create another header cuda/shuffle.
you don't need to handle all data types one by one, or by size. My suggestion is to create an array of uint32_t and then use memcpy. Even better if you find a way to use bit_cast.

fbusato · 2024-12-20T18:47:37Z

I updated #2976 to better formalize the features and checks of these functions

soumikiith · 2024-12-21T06:08:45Z

One Question:

While computing laneid, can I use modulo operator ? Or is the preferable way to fetch it directly from assembly using asm instructions?

Note that my doubt is only in the context of shfl_up and shfl_down.

Also, why does a mask value need to be passed (I know that the default value is assigned) in shfl_xor? Is not passing lanemask sufficient ?

fbusato · 2024-12-23T17:25:20Z

While computing laneid, can I use modulo operator ? Or is the preferable way to fetch it directly from assembly using asm instructions?

you can use C++ API for PTX, see https://nvidia.github.io/cccl/libcudacxx/ptx/instructions/special_registers.html#laneid

Also, why does a mask value need to be passed (I know that the default value is assigned) in shfl_xor? Is not passing lanemask sufficient ?

Referring to the official documentation, laneMask and mask have different meaning. mask represents the active lanes, while laneMask is the value to apply to the XOR operator, i.e. laneid() ^ laneMask

…ded extra supports for checks.

soumikiith · 2024-12-24T07:37:14Z

Hi, I have added the checks (I need to fix the assertion statements, though). Please check them and let me know if this is meeting your expected requirements. I will soon commit the casting of different data types using memcpy.

Please let me know of any additional requirements.

soumikiith · 2024-12-25T05:42:47Z

Hi,
I have added the code to do the __shfl operations for various data types. Please let me know if anything is to be added or if anything is flawed. I will happily revise my code.

Merry Christmas !!

fbusato

please also add the related tests

fbusato · 2024-12-30T16:54:17Z

libcudacxx/include/cuda/__shuffle/safe_shuffle.h

+_LIBCUDACXX_BEGIN_NAMESPACE_CUDA
+    template <typename T>
+    constexpr bool is_supported_type_v = false;
+    template <> constexpr bool is_supported_type_v<int> = true;


Important. Please don't specialize for a fixed set of types. shuffle needs to work with any trivially copyable (and construcible) data type

According to the documentation the warp level instructions only accept a set of arithmetic types https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-shuffle-functions

I believe we can get away with

template<class _Tp> _CCCL_INLINE_VAR constexpr bool __can_warp_shuffle_v = (_CUDA_VSTD::is_arithmetic_v<_Tp> && sizeof(_Tp) >= sizeof(int)) || _CUDA_VSTD::__is_extended_floating_point_v<_Tp>

Not that

This is a nonpublic helper so it needs to be __ugly that is also true for the template arguments

This is only valid if _CCCL_HAS_NO_VARIABLE_TEMPLATES is defined, otherwise you need to define a struct

It is very common in CUDA to use warp shuffle to move types outside of the standard accepted types. Many libraries provide their own method for moving generic types. It makes sense to extend these functions for any trivially copyable type

libcudacxx/include/cuda/__shuffle/safe_shuffle.h

fbusato · 2024-12-30T16:57:10Z

libcudacxx/include/cuda/__shuffle/safe_shuffle.h

+
+    //Input validation for shuffle operations
+    void _CCCL_DEVICE validate_shuffle_inputs(int width, unsigned mask)
+    {


Suggestion. I would call this function validate_width_mask

fbusato · 2024-12-30T16:57:45Z

libcudacxx/include/cuda/__shuffle/safe_shuffle.h

+    //Input validation for shuffle operations
+    void _CCCL_DEVICE validate_shuffle_inputs(int width, unsigned mask)
+    {
+        _CCCL_ASSERT((width <= warpSize), "Width must not exceed warp size"); // width must not exceed warp size


Important: width must be greater or equal than zero

Please drop the additional comments. The wording of the assert should suffice

fbusato · 2024-12-30T16:59:35Z

libcudacxx/include/cuda/__shuffle/safe_shuffle.h

+    #endif
+
+    //Input validation for shuffle operations
+    void _CCCL_DEVICE validate_shuffle_inputs(int width, unsigned mask)


Important: Please use fixed-size integers provided by cuda/std/cstdint, e.g. ::cuda::std::uint32_t

fbusato · 2024-12-30T17:00:48Z

libcudacxx/include/cuda/__shuffle/safe_shuffle.h

+    {
+        _CCCL_ASSERT((width <= warpSize), "Width must not exceed warp size"); // width must not exceed warp size
+        _CCCL_ASSERT((mask & __activemask()) == mask, "Mask must be a subset of the active mask"); // mask must be a subset of __activemask()
+        _CCCL_ASSERT((width > 0 && (width & (width - 1)) == 0), "Width must be a power of two"); // width must be a power of two


Suggestion. Please use ::cuda::std::has_single_bit function instead

fbusato · 2024-12-30T17:02:23Z

libcudacxx/include/cuda/__shuffle/safe_shuffle.h

+        _CCCL_ASSERT(is_supported_type_v<T>, "T must be a supported type for warp shuffle operations"); // T must be a supported type for warp shuffle operations
+        validate_shuffle_inputs(width, mask); // validate inputs (width and mask)
+        _CCCL_ASSERT((srcLane >= 0 && srcLane < width), "srcLane must be in the range [0, width)"); // srcLane must be in the range [0, width)
+        //scrLane mustbe part of mask


mustbe -> must be

fbusato · 2024-12-30T17:13:12Z

libcudacxx/include/cuda/__shuffle/safe_shuffle.h

+        //implement the logic for shfl
+        uint32_t buffer[sizeof(T) / sizeof(uint32_t)+1];
+        int numElements;
+        to_32bitBuffer(var, buffer, numElements);


Suggestion. This logic can be greatly improved by using the copy semantic of cuda::std::array returned by to_32bitBuffer

fbusato · 2024-12-30T17:13:28Z

libcudacxx/include/cuda/__shuffle/safe_shuffle.h

+    _CCCL_DEVICE void to_32bitBuffer(T& var, uint32_t* outArray, int& numElements)
+    {
+        constexpr size_t typeSize = sizeof(T);
+        constexpr int elements = (typeSize + sizeof(uint32_t) - 1) / sizeof(uint32_t);


Suggestion. Use cuda::ceil_div

also use cuda::std::bit_cast

fbusato · 2024-12-30T17:14:26Z

libcudacxx/include/cuda/__shuffle/safe_shuffle.h

+        uint32_t buffer[sizeof(T) / sizeof(uint32_t)+1];
+        int numElements;
+        to_32bitBuffer(var, buffer, numElements);
+        for(int i=0;i<numElements;i++)


Important. All loops with fixed number of iterations should be marked with #pragma unroll

miscco · 2025-01-02T12:18:28Z

libcudacxx/include/cuda/__shuffle/safe_shuffle.h

@@ -0,0 +1,161 @@
+
+#ifndef _CUDA_FUNCTIONAL_SHUFFLE_SAFETY_H


I would prefer if we could rename this to warp_shuffle.h or just warp.h, because that describes the intrinsics better

agree. I like warp_shuffle.h

…::std::standard int instead of int. 3) Improved logic for to_32bitBuffer. 4) Used #pragma unroll before every loop. 5) Used cuda::std::array instead of normal array declarations and improved logic.

miscco · 2025-01-02T12:18:39Z

libcudacxx/include/cuda/__shuffle/safe_shuffle.h

+#include <cuda/std/type_traits>
+#include <cuda/std/bit>
+#include <cuda/std/memory>


Please include the relevant subheaders only

miscco · 2025-01-02T12:19:35Z

libcudacxx/include/cuda/__shuffle/safe_shuffle.h

+#include <cuda/std/__cmath/nvfp16.h>
+#include <cuda/std/__cmath/nvbf16.h>
+
+#define _CCCL_HAS_CUDA_COMPILER 1 //fix for now -- to be deleted later


Yeah this needs to go

miscco · 2025-01-02T12:20:05Z

libcudacxx/include/cuda/__shuffle/safe_shuffle.h

+#include <cuda/std/__cmath/nvfp16.h>
+#include <cuda/std/__cmath/nvbf16.h>


I believe those are the wrong includes, we probably only want to know whether something is an extended floating point type

miscco · 2025-01-02T12:23:16Z

libcudacxx/include/cuda/__shuffle/safe_shuffle.h

+_LIBCUDACXX_BEGIN_NAMESPACE_CUDA
+    template <typename T>
+    constexpr bool is_supported_type_v = false;
+    template <> constexpr bool is_supported_type_v<int> = true;


I believe we can get away with

template<class _Tp> _CCCL_INLINE_VAR constexpr bool __can_warp_shuffle_v = (_CUDA_VSTD::is_arithmetic_v<_Tp> && sizeof(_Tp) >= sizeof(int)) || _CUDA_VSTD::__is_extended_floating_point_v<_Tp>

Not that

This is a nonpublic helper so it needs to be __ugly that is also true for the template arguments

This is only valid if _CCCL_HAS_NO_VARIABLE_TEMPLATES is defined, otherwise you need to define a struct

miscco · 2025-01-02T12:25:21Z

libcudacxx/include/cuda/__shuffle/safe_shuffle.h

+    //Input validation for shuffle operations
+    void _CCCL_DEVICE validate_shuffle_inputs(int width, unsigned mask)
+    {
+        _CCCL_ASSERT((width <= warpSize), "Width must not exceed warp size"); // width must not exceed warp size


Please drop the additional comments. The wording of the assert should suffice

miscco · 2025-01-02T12:26:59Z

libcudacxx/include/cuda/shuffle

+
+
+


This is missing our license header

miscco · 2025-01-02T12:27:11Z

libcudacxx/include/cuda/__shuffle/safe_shuffle.h

@@ -0,0 +1,161 @@
+


This is missing the license header

miscco · 2025-01-06T08:19:10Z

libcudacxx/include/cuda/__shuffle/warp_shuffle.h

+    template <typename T>
+    constexpr bool is_supported_type = cuda::std::is_trivially_copyable<T>::value;


Needs to be named something like __can_warp_shuffle. Also this also needs to

exclude types smaller than int

allow extended floating point types like __half and __nv_bfloat16

libcudacxx/include/cuda/__shuffle/warp_shuffle.h

miscco · 2025-01-06T08:20:12Z

libcudacxx/include/cuda/__shuffle/warp_shuffle.h

+    }
+
+    template<typename T>
+    _CCCL_DEVICE void to_32bitBuffer(T& var, cuda::std::int32_t numElements)


Those need to be __to_32bitBuffer and __from_32bitBuffer

Also all variables need to be uglyfied

Unnecessary Comments removed. Co-authored-by: Michael Schellenberger Costa <[email protected]>

Co-authored-by: Michael Schellenberger Costa <[email protected]>

Provide generic and safe C++ interfaces for warp shuffle: Issue NVIDI…

22344f4

…A#2976

soumikiith requested review from a team as code owners December 20, 2024 13:01

soumikiith requested review from wmaxey and alliepiper December 20, 2024 13:01

soumikiith force-pushed the main branch from 23b1242 to 22344f4 Compare December 24, 2024 07:17

soumikiith and others added 3 commits December 24, 2024 12:48

Fix: Moved the contents to a new location for better locality, and ad…

a998c6c

…ded extra supports for checks.

Merge branch 'main' into main

94a55dc

Merge remote-tracking branch 'origin/main'

764d4c4

Added shfl operations using memcpy.

599f49a

fbusato requested changes Dec 30, 2024

View reviewed changes

miscco reviewed Jan 2, 2025

View reviewed changes

Fix: 1) changed name : safe_shuffle.h -> warp_shuffle.h. 2) Used cuda…

f5d7adc

…::std::standard int instead of int. 3) Improved logic for to_32bitBuffer. 4) Used #pragma unroll before every loop. 5) Used cuda::std::array instead of normal array declarations and improved logic.

miscco reviewed Jan 6, 2025

View reviewed changes

soumikiith and others added 3 commits January 6, 2025 14:10

Update libcudacxx/include/cuda/__shuffle/warp_shuffle.h

30ce7c4

Unnecessary Comments removed. Co-authored-by: Michael Schellenberger Costa <[email protected]>

Merge branch 'NVIDIA:main' into main

dc92de5

Update libcudacxx/include/cuda/shuffle

386028a

Co-authored-by: Michael Schellenberger Costa <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide generic and safe C++ interfaces for warp shuffle: Issue #2976 #3210

Provide generic and safe C++ interfaces for warp shuffle: Issue #2976 #3210

soumikiith commented Dec 20, 2024 •

edited

Loading

copy-pr-bot bot commented Dec 20, 2024

fbusato commented Dec 20, 2024

fbusato commented Dec 20, 2024

soumikiith commented Dec 21, 2024 •

edited

Loading

fbusato commented Dec 23, 2024

soumikiith commented Dec 24, 2024

soumikiith commented Dec 25, 2024

fbusato left a comment

fbusato Dec 30, 2024

miscco Jan 2, 2025

miscco Jan 2, 2025

fbusato Jan 2, 2025

fbusato Dec 30, 2024

fbusato Dec 30, 2024

miscco Jan 2, 2025

fbusato Dec 30, 2024

fbusato Dec 30, 2024

fbusato Dec 30, 2024

fbusato Dec 30, 2024

fbusato Dec 30, 2024

miscco Jan 2, 2025

fbusato Dec 30, 2024

miscco Jan 2, 2025

fbusato Jan 2, 2025

miscco Jan 2, 2025

miscco Jan 2, 2025

miscco Jan 2, 2025

miscco Jan 2, 2025

miscco Jan 2, 2025

miscco Jan 2, 2025

miscco Jan 2, 2025

miscco Jan 6, 2025

miscco Jan 6, 2025

		#include <cuda/std/__cmath/nvfp16.h>
		#include <cuda/std/__cmath/nvbf16.h>

		template <typename T>
		constexpr bool is_supported_type = cuda::std::is_trivially_copyable<T>::value;

Provide generic and safe C++ interfaces for warp shuffle: Issue #2976 #3210

Are you sure you want to change the base?

Provide generic and safe C++ interfaces for warp shuffle: Issue #2976 #3210

Conversation

soumikiith commented Dec 20, 2024 • edited Loading

Description

Checklist

copy-pr-bot bot commented Dec 20, 2024

fbusato commented Dec 20, 2024

fbusato commented Dec 20, 2024

soumikiith commented Dec 21, 2024 • edited Loading

fbusato commented Dec 23, 2024

soumikiith commented Dec 24, 2024

soumikiith commented Dec 25, 2024

fbusato left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

soumikiith commented Dec 20, 2024 •

edited

Loading

soumikiith commented Dec 21, 2024 •

edited

Loading