-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ray-native collective communication library #35311
Comments
Hi @valiantljk We've developed the Ray-native-based collective implementation, and we're going to contribute it to community recently. |
But in the short time, we still need to maintain gloo based lib. Do you have any issue in gloo mode? |
Hi @valiantljk Regarding your large-scale actor tasks invocation scenario, we have been developing a batch remote API recently to improve the performance of batch calling actor tasks. We plan to build a Ray-native collective communication framework on top of it. |
@jovany-wang sounds great. I’d like to learn more about it. Is it same as what @larrylian mentioned? In which the ray native collective will be based on the batch submission API? |
It doesn’t work. The pygloo needs to be updated. |
|
Any update on this collective-native support? |
@valiantljk |
@jovany-wang I see that you're a maintainer for the Could you please take a look? This means I can't use the Ray collective communication functions. |
Hi, I'm a bot from the Ray team :) To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months. If there is no further activity in the 14 days, the issue will be closed!
You can always ask for help on our discussion forum or Ray's public slack channel. |
Description
Ray collective has supported both gloo and nccl as backend, and currently supports torch.Tensor, numpy.ndarray and cupy.ndarray.
There are cases where users may not have the required dependency for nccl, e.g., cuda; And for gloo, the current pygloo doesn't seem to be well maintained: https://github.com/ray-project/pygloo.
The idea of Ray-native collective communication is to implement the collective primitives via Ray actor and task.
Users can directly use these libraries after
pip install ray
and generic python objects are expected to be supported in these API.The distributed object store can be leveraged and node-local scheduling may also be explored to achieve high bandwidth and low latency collective operation.
Adding these Ray-native implementation will largely speedup the adoption of Ray collective in the long term. Users should have the flexibility to enjoy either Ray-native, or existing Ray collective (with gloo or nccl backend).
The implementation should be a superset of current Ray collective, ideally exposed as another backend, i.e., gloo, nccl, ray
Use case
In @ray-project/deltacat, we have a use case where a single actor needs to handle large volume of data from thousands/millions of tasks; However, we found that this can be largely improved by using distributed actors and an all-reduce operation; It'll significantly simplify our implementation if we have the ray-native collective library, either through a tree-based implementation or a ring-based implementation.
The text was updated successfully, but these errors were encountered: