[NVIDIA GPU] [XLA_GPU_MS_COLLECTIVE] Support round-robin runtime stream assignment #22450

terryysun · 2025-02-07T03:34:15Z

With #19026, LHS can overlap appropriate async collectives without deadlock. This PR adds support at runtime where we leverage the overlapping schedule produced by LHS and perform a round-robin stream assignment for collectives.

frgossen

Thank you for working on this. Multiple stream for collectives will be great to have!

This is a big PR and I think at least the stream assignment, the runtime integration, and some util functions could be three separate PRs.

One thing I am wondering is if we could do this in the latency-hiding scheduler. The scheduler models resources and we could have one per collective stream. That way the scheduler would perform the stream assignment and compose well with it. I'm not saying that would be better but it might be worth thinking about since you seem to run into issues with the LHS (implied by some comments).

frgossen · 2025-02-13T18:51:53Z

xla/backends/gpu/collectives/gpu_clique_key.h

 inline constexpr int64_t kAsyncStreamTotal =
-    static_cast<int64_t>(AsyncStreamKind::kMemCpyP2P) + 1;
+    std::max(static_cast<int64_t>(AsyncStreamKind::kMemCpyP2P) + 1, (int64_t)7);


Can you extract this 7 as a constant with a meaningful name?

frgossen · 2025-02-13T18:55:26Z

xla/service/gpu/execution_stream_assignment.cc

@@ -36,6 +36,21 @@ limitations under the License.
 #include "xla/side_effect_util.h"

 namespace xla::gpu {
+namespace {
+static bool is_async_collective(const HloInstruction* instruction) {


There may be something like this in collective_ops_utils.cc. If not that could be a good place to add it.

frgossen · 2025-02-13T18:55:56Z

xla/service/gpu/execution_stream_assignment.cc

@@ -36,6 +36,21 @@ limitations under the License.
 #include "xla/side_effect_util.h"

 namespace xla::gpu {
+namespace {


No need for anonymous space for static functions

frgossen · 2025-02-13T18:57:36Z

xla/backends/gpu/collectives/gpu_clique_key.h

@@ -40,8 +40,10 @@ enum class AsyncStreamKind : int64_t {
  kMemCpyP2P = 3,   // Stream for MemCpyP2P
 };

+// Taking the maximum of max stream kind + 1 and 4 (max compute stream) + 2 (max
+// collective stream) + 1;


This is not clear to me. Is kAsyncStreamTotal a static upper bound? How is that enforced when the number of collective streams can be set per flag xla_gpu_experimental_parallel_collective_overlap_limit? Am I misunderstanding this?

frgossen · 2025-02-13T19:02:40Z

xla/backends/gpu/codegen/custom.cc

-        std::make_unique<NcclThunkType>(thunk_info, instr, buffers);
+
+    const ExecutionStreamAssignment& stream_assignment =
+        ir_emitter_context.execution_stream_assignment();


Why is this passes separately? Can we add this to the thunks when emitting them? I think we do something like that for send and recv already

frgossen · 2025-02-13T19:08:49Z

xla/service/gpu/execution_stream_assignment.cc

+static bool is_async_collective(const HloInstruction* instruction) {
+  if (instruction->IsAsynchronous()) {
+    auto opcode = instruction->async_wrapped_opcode();
+    return opcode == HloOpcode::kAllGather || opcode == HloOpcode::kAllReduce ||


I don't think this covers all collectives. If you move this to collective_ops_utils, we can test it too

frgossen · 2025-02-13T19:09:45Z

xla/service/gpu/execution_stream_assignment.cc

@@ -45,7 +60,11 @@ ExecutionStreamAssignment::ExecutionStreamAssignment(
  // on the entrypoint computation will be assigned `ExecutionStreamId(0)`, and
  // each invocation of `async-start` will result in the target computation
  // being assigned a new `ExecutionStreamId`.
-  ExecutionStreamId next_stream_id = ExecutionStreamId(1);
+  ExecutionStreamId next_compute_stream_id = ExecutionStreamId(1);
+  ExecutionStreamId next_collective_stream_id =


Why is the tyoe of the next collective stream id ExecutionStreamId?

frgossen · 2025-02-13T19:12:01Z

xla/service/gpu/gpu_compiler.cc

@@ -405,7 +405,19 @@ GpuThunkAotCompilationResult::LoadExecutable(
                                  compiler->BufferSizeBytesFunction(),
                                  /*can_share_buffer=*/nullptr));

-  ExecutionStreamAssignment execution_stream_assignment(hlo_module.get());
+  ExecutionStreamAssignment execution_stream_assignment(


This looks like duplicate code.

frgossen · 2025-02-13T19:13:08Z

xla/service/gpu/gpu_latency_hiding_scheduler.cc

@@ -126,19 +126,43 @@ bool IsAsyncPair(const HloInstruction& from, const HloInstruction& target) {
  return IsGpuAsyncStart(from) && IsGpuAsyncDone(target);
 }

+// Util function for getting replica groups from different data structures.
+static std::vector<std::vector<int64_t>> get_replica_groups(


I think there is something like this in the collective ops utils. Can you mve it there and test it separately?

frgossen · 2025-02-13T19:13:29Z

xla/service/gpu/gpu_latency_hiding_scheduler.cc

+                        std::vector<int64_t> ids({pair.first, pair.second});
+                        return ids;
+                      });
+  } else {


Does this work for all collectives except collective-permute start? Can we test that?

terryysun added 10 commits December 18, 2024 01:07

async collective stream assignment

73d5896

fix collective type filter

f299af7

add knob for collective multi-streaming

86a36cd

add e2e test

c5ffe00

add og trace

4057a5c

Merge branch 'main' into terryysun/multi_stream_runtime

d2962e2

handle collective permute

82d41f7

pipe flag

7488c45

pipe assignment to thunks

50c812e

Merge branch 'main' into terryysun/multi_stream_runtime

a85ca89

terryysun added the kokoro:force-run Forces CI to rerun label Feb 7, 2025

kokoro-team removed the kokoro:force-run Forces CI to rerun label Feb 7, 2025

Merge branch 'main' into terryysun/multi_stream_runtime

4936484

terryysun requested a review from frgossen February 7, 2025 03:55

Merge branch 'main' into terryysun/multi_stream_runtime

e433d9b

terryysun changed the title ~~[NVIDIA GPU] [Async Collective Multi-streaming] Support round-robin runtime stream assignment~~ [NVIDIA GPU] [XLA_GPU_MS_COLLECTIVE] Support round-robin runtime stream assignment Feb 13, 2025

frgossen suggested changes Feb 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NVIDIA GPU] [XLA_GPU_MS_COLLECTIVE] Support round-robin runtime stream assignment #22450

[NVIDIA GPU] [XLA_GPU_MS_COLLECTIVE] Support round-robin runtime stream assignment #22450

terryysun commented Feb 7, 2025

frgossen left a comment

frgossen Feb 13, 2025

frgossen Feb 13, 2025

frgossen Feb 13, 2025

frgossen Feb 13, 2025

frgossen Feb 13, 2025

frgossen Feb 13, 2025

frgossen Feb 13, 2025

frgossen Feb 13, 2025

frgossen Feb 13, 2025

frgossen Feb 13, 2025

[NVIDIA GPU] [XLA_GPU_MS_COLLECTIVE] Support round-robin runtime stream assignment #22450

Are you sure you want to change the base?

[NVIDIA GPU] [XLA_GPU_MS_COLLECTIVE] Support round-robin runtime stream assignment #22450

Conversation

terryysun commented Feb 7, 2025

frgossen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment