-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PTX shfl_sync
#3241
base: main
Are you sure you want to change the base?
PTX shfl_sync
#3241
Conversation
🟩 CI finished in 1h 49m: Pass: 100%/170 | Total: 3d 02h | Avg: 26m 12s | Max: 1h 08m | Hits: 76%/22526
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 170)
# | Runner |
---|---|
125 | linux-amd64-cpu16 |
19 | linux-amd64-gpu-v100-latest-1 |
15 | windows-amd64-cpu16 |
10 | linux-arm64-cpu16 |
1 | linux-amd64-gpu-h100-latest-1-testing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave this a quick review. I would love to have @ahendriksen's opinion, since it touches his work on the PTX exposure. Also, he has a way better PTX understanding than me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have sent some comments in private as well. The data
parameter should be a template parameter to allow shuffling any 32-bit value.
Co-authored-by: Bernhard Manfred Gruber <[email protected]>
🟩 CI finished in 1h 37m: Pass: 100%/170 | Total: 2d 17h | Avg: 23m 17s | Max: 1h 05m | Hits: 82%/22529
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 170)
# | Runner |
---|---|
125 | linux-amd64-cpu16 |
19 | linux-amd64-gpu-v100-latest-1 |
15 | windows-amd64-cpu16 |
10 | linux-arm64-cpu16 |
1 | linux-amd64-gpu-h100-latest-1-testing |
@ahendriksen @miscco I modified the return type and added the predicate as an output parameter in the last commit |
🟨 CI finished in 2h 59m: Pass: 98%/164 | Total: 3d 03h | Avg: 27m 26s | Max: 1h 13m | Hits: 434%/15316
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 164)
# | Runner |
---|---|
122 | linux-amd64-cpu16 |
19 | linux-amd64-gpu-v100-latest-1 |
12 | windows-amd64-cpu16 |
10 | linux-arm64-cpu16 |
1 | linux-amd64-gpu-h100-latest-1-testing |
🟨 CI finished in 1h 37m: Pass: 99%/164 | Total: 1d 01h | Avg: 9m 19s | Max: 1h 11m | Hits: 536%/17656
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 164)
# | Runner |
---|---|
122 | linux-amd64-cpu16 |
19 | linux-amd64-gpu-v100-latest-1 |
12 | windows-amd64-cpu16 |
10 | linux-arm64-cpu16 |
1 | linux-amd64-gpu-h100-latest-1-testing |
🟩 CI finished in 13h 56m: Pass: 100%/164 | Total: 1d 00h | Avg: 9m 00s | Max: 37m 07s | Hits: 536%/17656
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 164)
# | Runner |
---|---|
122 | linux-amd64-cpu16 |
19 | linux-amd64-gpu-v100-latest-1 |
12 | windows-amd64-cpu16 |
10 | linux-arm64-cpu16 |
1 | linux-amd64-gpu-h100-latest-1-testing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to make this easier to absorb into the generated cuda::ptx code:
- Please create a different function for each shuffle mode, i.e.,
shfl_sync_up, shfl_sync_down, ...
. - Use an output parameter (
bool& pred
) for the predicate - Create an overload for each function that does not take the output argument
I agree with all of these changes, but I'm concerned that with the automatic generation of shuffle instructions, we'll lose all the checks that are quite useful in practice. |
🟨 CI finished in 2h 42m: Pass: 97%/164 | Total: 2d 16h | Avg: 23m 27s | Max: 1h 13m | Hits: 452%/17656
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 164)
# | Runner |
---|---|
122 | linux-amd64-cpu16 |
19 | linux-amd64-gpu-v100-latest-1 |
12 | windows-amd64-cpu16 |
10 | linux-arm64-cpu16 |
1 | linux-amd64-gpu-h100-latest-1-testing |
🟩 CI finished in 3h 06m: Pass: 100%/164 | Total: 2d 17h | Avg: 23m 56s | Max: 1h 06m | Hits: 458%/17656
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 164)
# | Runner |
---|---|
122 | linux-amd64-cpu16 |
19 | linux-amd64-gpu-v100-latest-1 |
12 | windows-amd64-cpu16 |
10 | linux-arm64-cpu16 |
1 | linux-amd64-gpu-h100-latest-1-testing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check for consistency with the PTX docs.
I see that there are some higher-level helper functions for shfl_sync as well. I am not opposed to including those. But we need a clean separation. @bernhardmgruber : what do you think would be good strategy here?
🟩 CI finished in 2h 13m: Pass: 100%/164 | Total: 2d 15h | Avg: 23m 15s | Max: 1h 07m | Hits: 452%/17656
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 164)
# | Runner |
---|---|
122 | linux-amd64-cpu16 |
19 | linux-amd64-gpu-v100-latest-1 |
12 | windows-amd64-cpu16 |
10 | linux-arm64-cpu16 |
1 | linux-amd64-gpu-h100-latest-1-testing |
Related to #2976
Description
Provide C++ implementation of PTX
shfl_sync
.In addition to CUDA intrinsics, the function provide the following features: