-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core][autoscaler] Autoscaler doesn't scale up correctly when the KubeRay RayCluster is not in the goal state #48909
Conversation
Signed-off-by: kaihsun <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this makes muc sense. We could merge this than #47967 then?
if worker in node_set: | ||
worker_groups_with_pending_deletes.add(node_type) | ||
break | ||
|
||
worker_groups_with_finished_deletes = ( | ||
worker_groups_with_deletes - worker_groups_with_pending_deletes | ||
) | ||
return worker_groups_with_pending_deletes, worker_groups_with_finished_deletes | ||
return ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: function signature as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename the function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed the function: 24ba62a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, sorry, i meant the return types. it's currently saying it's returning a tuple of 2, while it's actually 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch! updated 2c9c3d0
"op": "replace", | ||
"path": "/spec/workerGroupSpecs/0/replicas", | ||
"value": desired_replicas + 1, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: is it possible to also add tests that to delete workers are handled correctly when calculating the goal state by the change for regression?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added 24ba62a
Oops, I didn't realize that there is a similar PR for #46473. I think merging this PR is sufficient. This PR is more ready for merging from both testing and KubeRay perspectives. In addition, Autoscaler V2 is currently on the critical path for the KubeRay release. I prefer to go with this PR and I will communicate with the contributor (perhaps offering them other issues if they are interested in contributing). |
Signed-off-by: kaihsun <[email protected]>
Signed-off-by: kaihsun <[email protected]>
Signed-off-by: kaihsun <[email protected]>
Signed-off-by: kaihsun <[email protected]>
Signed-off-by: kaihsun <[email protected]>
…eRay RayCluster is not in the goal state (ray-project#48909) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? ### Issue * Create a Autoscaler V2 RayCluster CR. * head Pod: `num-cpus: 0` * worker Pod: Each worker Pod has 1 CPU, and the `maxReplicas` of the worker group is 10. * Run the following script in the head Pod: https://gist.github.com/kevin85421/6f09368ba48572e28f53654dca854b57 * There are 10 scale requests to add a new node. However, only some of them will be created (e.g., 5). ### Reason In the reproduction script above, the `cloud_instance_updater` will send a request to scale up one worker Pod 10 times because the `maxReplicas` of the worker group is set to 10. However, the construction of the scale_request depends on the Pods in the Kubernetes cluster. For example, * cluster state: RayCluster Replicas: 2, Ray Pods: 1 * 1st scale request: launch 1 node --> goal state: RayCluster Replicas: 2 (Ray Pods + 1) * 2nd scale request: launch 1 node --> goal state: RayCluster Replicas: 2 (Ray Pods + 1) --> **this should be 3!** The above example is expected to create 3 Pods. However, it will ultimately create only 2 Pods. ### Solution Use RayCluster CR instead of Ray Pods to build scale requests. ## Related issue number Closes ray-project#46473 ## Checks 10 worker Pods are created successfully. <img width="1373" alt="Screenshot 2024-11-24 at 2 11 39 AM" src="https://github.com/user-attachments/assets/c42c6cdd-3bf0-4aa9-a928-630c12ff5569"> - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: kaihsun <[email protected]> Signed-off-by: Connor Sanders <[email protected]>
…eRay RayCluster is not in the goal state (ray-project#48909) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? ### Issue * Create a Autoscaler V2 RayCluster CR. * head Pod: `num-cpus: 0` * worker Pod: Each worker Pod has 1 CPU, and the `maxReplicas` of the worker group is 10. * Run the following script in the head Pod: https://gist.github.com/kevin85421/6f09368ba48572e28f53654dca854b57 * There are 10 scale requests to add a new node. However, only some of them will be created (e.g., 5). ### Reason In the reproduction script above, the `cloud_instance_updater` will send a request to scale up one worker Pod 10 times because the `maxReplicas` of the worker group is set to 10. However, the construction of the scale_request depends on the Pods in the Kubernetes cluster. For example, * cluster state: RayCluster Replicas: 2, Ray Pods: 1 * 1st scale request: launch 1 node --> goal state: RayCluster Replicas: 2 (Ray Pods + 1) * 2nd scale request: launch 1 node --> goal state: RayCluster Replicas: 2 (Ray Pods + 1) --> **this should be 3!** The above example is expected to create 3 Pods. However, it will ultimately create only 2 Pods. ### Solution Use RayCluster CR instead of Ray Pods to build scale requests. ## Related issue number Closes ray-project#46473 ## Checks 10 worker Pods are created successfully. <img width="1373" alt="Screenshot 2024-11-24 at 2 11 39 AM" src="https://github.com/user-attachments/assets/c42c6cdd-3bf0-4aa9-a928-630c12ff5569"> - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: kaihsun <[email protected]> Signed-off-by: hjiang <[email protected]>
Why are these changes needed?
Issue
num-cpus: 0
maxReplicas
of the worker group is 10.Reason
In the reproduction script above, the
cloud_instance_updater
will send a request to scale up one worker Pod 10 times because themaxReplicas
of the worker group is set to 10.However, the construction of the scale_request depends on the Pods in the Kubernetes cluster. For example,
The above example is expected to create 3 Pods. However, it will ultimately create only 2 Pods.
Solution
Use RayCluster CR instead of Ray Pods to build scale requests.
Related issue number
Closes #46473
Checks
10 worker Pods are created successfully.
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.