[core][autoscaler] Autoscaler doesn't scale up correctly when the KubeRay RayCluster is not in the goal state #48909

kevin85421 · 2024-11-24T10:22:09Z

Why are these changes needed?

Issue

Create a Autoscaler V2 RayCluster CR.
- head Pod: num-cpus: 0
- worker Pod: Each worker Pod has 1 CPU, and the maxReplicas of the worker group is 10.
Run the following script in the head Pod: https://gist.github.com/kevin85421/6f09368ba48572e28f53654dca854b57
There are 10 scale requests to add a new node. However, only some of them will be created (e.g., 5).

Reason

In the reproduction script above, the cloud_instance_updater will send a request to scale up one worker Pod 10 times because the maxReplicas of the worker group is set to 10.

However, the construction of the scale_request depends on the Pods in the Kubernetes cluster. For example,

cluster state: RayCluster Replicas: 2, Ray Pods: 1
- 1st scale request: launch 1 node --> goal state: RayCluster Replicas: 2 (Ray Pods + 1)
- 2nd scale request: launch 1 node --> goal state: RayCluster Replicas: 2 (Ray Pods + 1) --> this should be 3!

The above example is expected to create 3 Pods. However, it will ultimately create only 2 Pods.

Solution

Use RayCluster CR instead of Ray Pods to build scale requests.

Related issue number

Closes #46473

Checks

10 worker Pods are created successfully.

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: kaihsun <[email protected]>

rickyyx

Ah, this makes muc sense. We could merge this than #47967 then?

rickyyx · 2024-11-24T22:23:43Z

python/ray/autoscaler/v2/instance_manager/cloud_providers/kuberay/cloud_provider.py

                if worker in node_set:
                    worker_groups_with_pending_deletes.add(node_type)
                    break

        worker_groups_with_finished_deletes = (
            worker_groups_with_deletes - worker_groups_with_pending_deletes
        )
-        return worker_groups_with_pending_deletes, worker_groups_with_finished_deletes
+        return (


nit: function signature as well.

rename the function?

renamed the function: 24ba62a

oops, sorry, i meant the return types. it's currently saying it's returning a tuple of 2, while it's actually 3.

good catch! updated 2c9c3d0

rickyyx · 2024-11-24T22:24:27Z

python/ray/autoscaler/v2/tests/test_node_provider.py

+            "op": "replace",
+            "path": "/spec/workerGroupSpecs/0/replicas",
+            "value": desired_replicas + 1,
+        }


nit: is it possible to also add tests that to delete workers are handled correctly when calculating the goal state by the change for regression?

added 24ba62a

kevin85421 · 2024-11-24T23:02:48Z

We could merge this than #47967 then?

Oops, I didn't realize that there is a similar PR for #46473. I think merging this PR is sufficient. This PR is more ready for merging from both testing and KubeRay perspectives. In addition, Autoscaler V2 is currently on the critical path for the KubeRay release. I prefer to go with this PR and I will communicate with the contributor (perhaps offering them other issues if they are interested in contributing).

Signed-off-by: kaihsun <[email protected]>

…eRay RayCluster is not in the goal state (ray-project#48909)   ## Why are these changes needed? ### Issue * Create a Autoscaler V2 RayCluster CR. * head Pod: `num-cpus: 0` * worker Pod: Each worker Pod has 1 CPU, and the `maxReplicas` of the worker group is 10. * Run the following script in the head Pod: https://gist.github.com/kevin85421/6f09368ba48572e28f53654dca854b57 * There are 10 scale requests to add a new node. However, only some of them will be created (e.g., 5). ### Reason In the reproduction script above, the `cloud_instance_updater` will send a request to scale up one worker Pod 10 times because the `maxReplicas` of the worker group is set to 10. However, the construction of the scale_request depends on the Pods in the Kubernetes cluster. For example, * cluster state: RayCluster Replicas: 2, Ray Pods: 1 * 1st scale request: launch 1 node --> goal state: RayCluster Replicas: 2 (Ray Pods + 1) * 2nd scale request: launch 1 node --> goal state: RayCluster Replicas: 2 (Ray Pods + 1) --> **this should be 3!** The above example is expected to create 3 Pods. However, it will ultimately create only 2 Pods. ### Solution Use RayCluster CR instead of Ray Pods to build scale requests. ## Related issue number Closes ray-project#46473 ## Checks 10 worker Pods are created successfully. <img width="1373" alt="Screenshot 2024-11-24 at 2 11 39 AM" src="https://github.com/user-attachments/assets/c42c6cdd-3bf0-4aa9-a928-630c12ff5569"> - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: kaihsun <[email protected]> Signed-off-by: Connor Sanders <[email protected]>

…eRay RayCluster is not in the goal state (ray-project#48909)   ## Why are these changes needed? ### Issue * Create a Autoscaler V2 RayCluster CR. * head Pod: `num-cpus: 0` * worker Pod: Each worker Pod has 1 CPU, and the `maxReplicas` of the worker group is 10. * Run the following script in the head Pod: https://gist.github.com/kevin85421/6f09368ba48572e28f53654dca854b57 * There are 10 scale requests to add a new node. However, only some of them will be created (e.g., 5). ### Reason In the reproduction script above, the `cloud_instance_updater` will send a request to scale up one worker Pod 10 times because the `maxReplicas` of the worker group is set to 10. However, the construction of the scale_request depends on the Pods in the Kubernetes cluster. For example, * cluster state: RayCluster Replicas: 2, Ray Pods: 1 * 1st scale request: launch 1 node --> goal state: RayCluster Replicas: 2 (Ray Pods + 1) * 2nd scale request: launch 1 node --> goal state: RayCluster Replicas: 2 (Ray Pods + 1) --> **this should be 3!** The above example is expected to create 3 Pods. However, it will ultimately create only 2 Pods. ### Solution Use RayCluster CR instead of Ray Pods to build scale requests. ## Related issue number Closes ray-project#46473 ## Checks 10 worker Pods are created successfully. <img width="1373" alt="Screenshot 2024-11-24 at 2 11 39 AM" src="https://github.com/user-attachments/assets/c42c6cdd-3bf0-4aa9-a928-630c12ff5569"> - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: kaihsun <[email protected]> Signed-off-by: hjiang <[email protected]>

update

d235f8e

Signed-off-by: kaihsun <[email protected]>

kevin85421 marked this pull request as ready for review November 24, 2024 10:37

kevin85421 requested review from hongchaodeng and a team as code owners November 24, 2024 10:37

kevin85421 assigned rickyyx Nov 24, 2024

rickyyx approved these changes Nov 24, 2024

View reviewed changes

kevin85421 added 2 commits November 25, 2024 04:52

update

24ba62a

Signed-off-by: kaihsun <[email protected]>

update

2c9c3d0

Signed-off-by: kaihsun <[email protected]>

kevin85421 added the go add ONLY when ready to merge, run all tests label Nov 25, 2024

rickyyx enabled auto-merge (squash) November 25, 2024 20:32

kevin85421 assigned kevin85421 and unassigned rickyyx Nov 25, 2024

update

743df87

Signed-off-by: kaihsun <[email protected]>

github-actions bot disabled auto-merge November 26, 2024 05:18

kevin85421 added 2 commits November 26, 2024 07:41

update

348f8f2

Signed-off-by: kaihsun <[email protected]>

fix tests

adfca34

Signed-off-by: kaihsun <[email protected]>

rickyyx merged commit ed3d48c into ray-project:master Nov 26, 2024
5 checks passed

kevin85421 mentioned this pull request Nov 26, 2024

[Feature] Add e2e tests for inconsistency between worker group's replicas and the number of Pods ray-project/kuberay#2575

Closed

2 tasks

kevin85421 mentioned this pull request Dec 4, 2024

[Umbrella] Autoscaler improvements ray-project/kuberay#2600

Open

31 tasks

kevin85421 mentioned this pull request Jan 23, 2025

[autoscaler] Bump Ray e2e test image ray-project/kuberay#2814

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core][autoscaler] Autoscaler doesn't scale up correctly when the KubeRay RayCluster is not in the goal state #48909

[core][autoscaler] Autoscaler doesn't scale up correctly when the KubeRay RayCluster is not in the goal state #48909

kevin85421 commented Nov 24, 2024 •

edited

Loading

rickyyx left a comment

rickyyx Nov 24, 2024

kevin85421 Nov 24, 2024

kevin85421 Nov 25, 2024

rickyyx Nov 25, 2024

kevin85421 Nov 25, 2024

rickyyx Nov 24, 2024

kevin85421 Nov 25, 2024

kevin85421 commented Nov 24, 2024 •

edited

Loading

[core][autoscaler] Autoscaler doesn't scale up correctly when the KubeRay RayCluster is not in the goal state #48909

[core][autoscaler] Autoscaler doesn't scale up correctly when the KubeRay RayCluster is not in the goal state #48909

Conversation

kevin85421 commented Nov 24, 2024 • edited Loading

Why are these changes needed?

Issue

Reason

Solution

Related issue number

Checks

rickyyx left a comment

Choose a reason for hiding this comment

rickyyx Nov 24, 2024

Choose a reason for hiding this comment

kevin85421 Nov 24, 2024

Choose a reason for hiding this comment

kevin85421 Nov 25, 2024

Choose a reason for hiding this comment

rickyyx Nov 25, 2024

Choose a reason for hiding this comment

kevin85421 Nov 25, 2024

Choose a reason for hiding this comment

rickyyx Nov 24, 2024

Choose a reason for hiding this comment

kevin85421 Nov 25, 2024

Choose a reason for hiding this comment

kevin85421 commented Nov 24, 2024 • edited Loading

kevin85421 commented Nov 24, 2024 •

edited

Loading

kevin85421 commented Nov 24, 2024 •

edited

Loading