Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ray-native collective communication library #35311

Open
valiantljk opened this issue May 13, 2023 · 10 comments
Open

Ray-native collective communication library #35311

valiantljk opened this issue May 13, 2023 · 10 comments
Assignees
Labels
enhancement Request for new feature and/or capability ray-collective-lib stale The issue is stale. It will be closed within 7 days unless there are further conversation

Comments

@valiantljk
Copy link
Contributor

Description

Ray collective has supported both gloo and nccl as backend, and currently supports torch.Tensor, numpy.ndarray and cupy.ndarray.

There are cases where users may not have the required dependency for nccl, e.g., cuda; And for gloo, the current pygloo doesn't seem to be well maintained: https://github.com/ray-project/pygloo.

The idea of Ray-native collective communication is to implement the collective primitives via Ray actor and task.
Users can directly use these libraries after pip install ray and generic python objects are expected to be supported in these API.

The distributed object store can be leveraged and node-local scheduling may also be explored to achieve high bandwidth and low latency collective operation.

Adding these Ray-native implementation will largely speedup the adoption of Ray collective in the long term. Users should have the flexibility to enjoy either Ray-native, or existing Ray collective (with gloo or nccl backend).

The implementation should be a superset of current Ray collective, ideally exposed as another backend, i.e., gloo, nccl, ray

Use case

In @ray-project/deltacat, we have a use case where a single actor needs to handle large volume of data from thousands/millions of tasks; However, we found that this can be largely improved by using distributed actors and an all-reduce operation; It'll significantly simplify our implementation if we have the ray-native collective library, either through a tree-based implementation or a ring-based implementation.

@valiantljk valiantljk added enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 13, 2023
@jovany-wang
Copy link
Contributor

Hi @valiantljk We've developed the Ray-native-based collective implementation, and we're going to contribute it to community recently.

@larrylian @jiaodong @gjoliver CC

@jovany-wang
Copy link
Contributor

But in the short time, we still need to maintain gloo based lib. Do you have any issue in gloo mode?

@jovany-wang jovany-wang added ray-collective-lib and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 13, 2023
@jovany-wang jovany-wang self-assigned this May 13, 2023
@larrylian
Copy link
Contributor

Hi @valiantljk Regarding your large-scale actor tasks invocation scenario, we have been developing a batch remote API recently to improve the performance of batch calling actor tasks. We plan to build a Ray-native collective communication framework on top of it.
This batch remote API should also be helpful for deltacat.
ray-project/enhancements#31

@valiantljk
Copy link
Contributor Author

Hi @valiantljk We've developed the Ray-native-based collective implementation, and we're going to contribute it to community recently.

@larrylian @jiaodong @gjoliver CC

@jovany-wang sounds great. I’d like to learn more about it. Is it same as what @larrylian mentioned? In which the ray native collective will be based on the batch submission API?

@valiantljk
Copy link
Contributor Author

But in the short time, we still need to maintain gloo based lib. Do you have any issue in gloo mode?

It doesn’t work. The pygloo needs to be updated.

@jovany-wang
Copy link
Contributor

jovany-wang commented May 15, 2023

@valiantljk

  1. The REP from @larrylian is for the underlying implementation in Ray core for primitive broadcast/gather. Other primitives may not need it.
  2. At the high level in ray collective lib, in the short time, we'll support 3 modes NCCL, GLOO and RAY_NATIVE, which should be easily to switch. Maybe we'll deprecate gloo model in the future.
  3. Please let me know the issue in pygloo, we'll fix it ASAP.

@valiantljk
Copy link
Contributor Author

Any update on this collective-native support?

@larrylian
Copy link
Contributor

@valiantljk
I'm pushing this feature. However, it will take a long time to fully support it.
Now I plan to design and implement several simple operators for ray's native set communication after the batch remote of ray is implemented.

@terraflops1048576
Copy link
Contributor

@jovany-wang I see that you're a maintainer for the pygloo repository. I filed an issue describing my difficulty with getting pygloo installed for Ray 2.24.0 on Python 3.11.

Could you please take a look? This means I can't use the Ray collective communication functions.

Copy link

stale bot commented Jan 22, 2025

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

  • If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
  • If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Request for new feature and/or capability ray-collective-lib stale The issue is stale. It will be closed within 7 days unless there are further conversation
Projects
None yet
Development

No branches or pull requests

4 participants