Ray-native collective communication library #35311

valiantljk · 2023-05-13T05:30:00Z

Description

Ray collective has supported both gloo and nccl as backend, and currently supports torch.Tensor, numpy.ndarray and cupy.ndarray.

There are cases where users may not have the required dependency for nccl, e.g., cuda; And for gloo, the current pygloo doesn't seem to be well maintained: https://github.com/ray-project/pygloo.

The idea of Ray-native collective communication is to implement the collective primitives via Ray actor and task.
Users can directly use these libraries after pip install ray and generic python objects are expected to be supported in these API.

The distributed object store can be leveraged and node-local scheduling may also be explored to achieve high bandwidth and low latency collective operation.

Adding these Ray-native implementation will largely speedup the adoption of Ray collective in the long term. Users should have the flexibility to enjoy either Ray-native, or existing Ray collective (with gloo or nccl backend).

The implementation should be a superset of current Ray collective, ideally exposed as another backend, i.e., gloo, nccl, ray

Use case

In @ray-project/deltacat, we have a use case where a single actor needs to handle large volume of data from thousands/millions of tasks; However, we found that this can be largely improved by using distributed actors and an all-reduce operation; It'll significantly simplify our implementation if we have the ray-native collective library, either through a tree-based implementation or a ring-based implementation.

The text was updated successfully, but these errors were encountered:

jovany-wang · 2023-05-13T11:46:21Z

Hi @valiantljk We've developed the Ray-native-based collective implementation, and we're going to contribute it to community recently.

@larrylian @jiaodong @gjoliver CC

jovany-wang · 2023-05-13T11:47:48Z

But in the short time, we still need to maintain gloo based lib. Do you have any issue in gloo mode?

larrylian · 2023-05-13T14:56:53Z

Hi @valiantljk Regarding your large-scale actor tasks invocation scenario, we have been developing a batch remote API recently to improve the performance of batch calling actor tasks. We plan to build a Ray-native collective communication framework on top of it.
This batch remote API should also be helpful for deltacat.
ray-project/enhancements#31

valiantljk · 2023-05-15T06:00:21Z

Hi @valiantljk We've developed the Ray-native-based collective implementation, and we're going to contribute it to community recently.

@larrylian @jiaodong @gjoliver CC

@jovany-wang sounds great. I’d like to learn more about it. Is it same as what @larrylian mentioned? In which the ray native collective will be based on the batch submission API?

valiantljk · 2023-05-15T06:01:22Z

But in the short time, we still need to maintain gloo based lib. Do you have any issue in gloo mode?

It doesn’t work. The pygloo needs to be updated.

jovany-wang · 2023-05-15T06:18:38Z

@valiantljk

The REP from @larrylian is for the underlying implementation in Ray core for primitive broadcast/gather. Other primitives may not need it.
At the high level in ray collective lib, in the short time, we'll support 3 modes NCCL, GLOO and RAY_NATIVE, which should be easily to switch. Maybe we'll deprecate gloo model in the future.
Please let me know the issue in pygloo, we'll fix it ASAP.

valiantljk · 2023-06-22T16:12:25Z

Any update on this collective-native support?

larrylian · 2023-06-23T02:51:14Z

@valiantljk
I'm pushing this feature. However, it will take a long time to fully support it.
Now I plan to design and implement several simple operators for ray's native set communication after the batch remote of ray is implemented.

terraflops1048576 · 2024-06-21T03:00:25Z

@jovany-wang I see that you're a maintainer for the pygloo repository. I filed an issue describing my difficulty with getting pygloo installed for Ray 2.24.0 on Python 3.11.

Could you please take a look? This means I can't use the Ray collective communication functions.

stale · 2025-01-22T00:30:11Z

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

valiantljk added enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 13, 2023

jovany-wang added ray-collective-lib and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 13, 2023

jovany-wang self-assigned this May 13, 2023

stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ray-native collective communication library #35311

Ray-native collective communication library #35311

valiantljk commented May 13, 2023

jovany-wang commented May 13, 2023

jovany-wang commented May 13, 2023

larrylian commented May 13, 2023

valiantljk commented May 15, 2023

valiantljk commented May 15, 2023

jovany-wang commented May 15, 2023 •

edited

Loading

valiantljk commented Jun 22, 2023

larrylian commented Jun 23, 2023

terraflops1048576 commented Jun 21, 2024

stale bot commented Jan 22, 2025

Ray-native collective communication library #35311

Ray-native collective communication library #35311

Comments

valiantljk commented May 13, 2023

Description

Use case

jovany-wang commented May 13, 2023

jovany-wang commented May 13, 2023

larrylian commented May 13, 2023

valiantljk commented May 15, 2023

valiantljk commented May 15, 2023

jovany-wang commented May 15, 2023 • edited Loading

valiantljk commented Jun 22, 2023

larrylian commented Jun 23, 2023

terraflops1048576 commented Jun 21, 2024

stale bot commented Jan 22, 2025

jovany-wang commented May 15, 2023 •

edited

Loading