Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core][autoscaler] Autoscaler doesn't scale up correctly when the KubeRay RayCluster is not in the goal state #48909

Merged
merged 6 commits into from
Nov 26, 2024

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Nov 24, 2024

Why are these changes needed?

Issue

  • Create a Autoscaler V2 RayCluster CR.
    • head Pod: num-cpus: 0
    • worker Pod: Each worker Pod has 1 CPU, and the maxReplicas of the worker group is 10.
  • Run the following script in the head Pod: https://gist.github.com/kevin85421/6f09368ba48572e28f53654dca854b57
  • There are 10 scale requests to add a new node. However, only some of them will be created (e.g., 5).

Reason

In the reproduction script above, the cloud_instance_updater will send a request to scale up one worker Pod 10 times because the maxReplicas of the worker group is set to 10.

However, the construction of the scale_request depends on the Pods in the Kubernetes cluster. For example,

  • cluster state: RayCluster Replicas: 2, Ray Pods: 1
    • 1st scale request: launch 1 node --> goal state: RayCluster Replicas: 2 (Ray Pods + 1)
    • 2nd scale request: launch 1 node --> goal state: RayCluster Replicas: 2 (Ray Pods + 1) --> this should be 3!

The above example is expected to create 3 Pods. However, it will ultimately create only 2 Pods.

Solution

Use RayCluster CR instead of Ray Pods to build scale requests.

Related issue number

Closes #46473

Checks

10 worker Pods are created successfully.

Screenshot 2024-11-24 at 2 11 39 AM
  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: kaihsun <[email protected]>
@kevin85421 kevin85421 marked this pull request as ready for review November 24, 2024 10:37
@kevin85421 kevin85421 requested review from hongchaodeng and a team as code owners November 24, 2024 10:37
Copy link
Contributor

@rickyyx rickyyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this makes muc sense. We could merge this than #47967 then?

if worker in node_set:
worker_groups_with_pending_deletes.add(node_type)
break

worker_groups_with_finished_deletes = (
worker_groups_with_deletes - worker_groups_with_pending_deletes
)
return worker_groups_with_pending_deletes, worker_groups_with_finished_deletes
return (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: function signature as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename the function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed the function: 24ba62a

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, sorry, i meant the return types. it's currently saying it's returning a tuple of 2, while it's actually 3.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! updated 2c9c3d0

"op": "replace",
"path": "/spec/workerGroupSpecs/0/replicas",
"value": desired_replicas + 1,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is it possible to also add tests that to delete workers are handled correctly when calculating the goal state by the change for regression?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added 24ba62a

@kevin85421
Copy link
Member Author

kevin85421 commented Nov 24, 2024

We could merge this than #47967 then?

Oops, I didn't realize that there is a similar PR for #46473. I think merging this PR is sufficient. This PR is more ready for merging from both testing and KubeRay perspectives. In addition, Autoscaler V2 is currently on the critical path for the KubeRay release. I prefer to go with this PR and I will communicate with the contributor (perhaps offering them other issues if they are interested in contributing).

Signed-off-by: kaihsun <[email protected]>
Signed-off-by: kaihsun <[email protected]>
@kevin85421 kevin85421 added the go add ONLY when ready to merge, run all tests label Nov 25, 2024
@rickyyx rickyyx enabled auto-merge (squash) November 25, 2024 20:32
@kevin85421 kevin85421 assigned kevin85421 and unassigned rickyyx Nov 25, 2024
Signed-off-by: kaihsun <[email protected]>
@github-actions github-actions bot disabled auto-merge November 26, 2024 05:18
Signed-off-by: kaihsun <[email protected]>
Signed-off-by: kaihsun <[email protected]>
@rickyyx rickyyx merged commit ed3d48c into ray-project:master Nov 26, 2024
5 checks passed
jecsand838 pushed a commit to jecsand838/ray that referenced this pull request Dec 4, 2024
…eRay RayCluster is not in the goal state (ray-project#48909)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

### Issue
  * Create a Autoscaler V2 RayCluster CR.
    * head Pod: `num-cpus: 0`
* worker Pod: Each worker Pod has 1 CPU, and the `maxReplicas` of the
worker group is 10.
* Run the following script in the head Pod:
https://gist.github.com/kevin85421/6f09368ba48572e28f53654dca854b57
* There are 10 scale requests to add a new node. However, only some of
them will be created (e.g., 5).

### Reason

In the reproduction script above, the `cloud_instance_updater` will send
a request to scale up one worker Pod 10 times because the `maxReplicas`
of the worker group is set to 10.

However, the construction of the scale_request depends on the Pods in
the Kubernetes cluster. For example,

* cluster state: RayCluster Replicas: 2, Ray Pods: 1
* 1st scale request: launch 1 node --> goal state: RayCluster Replicas:
2 (Ray Pods + 1)
* 2nd scale request: launch 1 node --> goal state: RayCluster Replicas:
2 (Ray Pods + 1) --> **this should be 3!**

The above example is expected to create 3 Pods. However, it will
ultimately create only 2 Pods.

### Solution

Use RayCluster CR instead of Ray Pods to build scale requests.

## Related issue number

Closes ray-project#46473

## Checks

10 worker Pods are created successfully.

<img width="1373" alt="Screenshot 2024-11-24 at 2 11 39 AM"
src="https://github.com/user-attachments/assets/c42c6cdd-3bf0-4aa9-a928-630c12ff5569">

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: kaihsun <[email protected]>
Signed-off-by: Connor Sanders <[email protected]>
dentiny pushed a commit to dentiny/ray that referenced this pull request Dec 7, 2024
…eRay RayCluster is not in the goal state (ray-project#48909)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

### Issue
  * Create a Autoscaler V2 RayCluster CR.
    * head Pod: `num-cpus: 0`
* worker Pod: Each worker Pod has 1 CPU, and the `maxReplicas` of the
worker group is 10.
* Run the following script in the head Pod:
https://gist.github.com/kevin85421/6f09368ba48572e28f53654dca854b57
* There are 10 scale requests to add a new node. However, only some of
them will be created (e.g., 5).

### Reason

In the reproduction script above, the `cloud_instance_updater` will send
a request to scale up one worker Pod 10 times because the `maxReplicas`
of the worker group is set to 10.

However, the construction of the scale_request depends on the Pods in
the Kubernetes cluster. For example,

* cluster state: RayCluster Replicas: 2, Ray Pods: 1
* 1st scale request: launch 1 node --> goal state: RayCluster Replicas:
2 (Ray Pods + 1)
* 2nd scale request: launch 1 node --> goal state: RayCluster Replicas:
2 (Ray Pods + 1) --> **this should be 3!**

The above example is expected to create 3 Pods. However, it will
ultimately create only 2 Pods.

### Solution

Use RayCluster CR instead of Ray Pods to build scale requests.


## Related issue number

Closes ray-project#46473

## Checks

10 worker Pods are created successfully.

<img width="1373" alt="Screenshot 2024-11-24 at 2 11 39 AM"
src="https://github.com/user-attachments/assets/c42c6cdd-3bf0-4aa9-a928-630c12ff5569">


- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: kaihsun <[email protected]>
Signed-off-by: hjiang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Ray autoscaler v2] Can't scaler up when using autoscaler v2
2 participants