Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vpa-recommender: Add support for configuring global max allowed resources #7560

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

ialidzhikov
Copy link
Contributor

@ialidzhikov ialidzhikov commented Dec 4, 2024

What type of PR is this?

/kind feature

What this PR does / why we need it:

See #7147

Which issue(s) this PR fixes:

Fixes #7147

Special notes for your reviewer:

N/A

Does this PR introduce a user-facing change?

vpa-recommender does now support two new flags - `--container-recommendation-max-allowed-cpu` and `--container-recommendation-max-allowed-memory`. The flags make possible to configure the global max allowed resources the vpa-recommender can recommend for a container. The flags aim to address a known limitation of VPA (vpa-recommender, in particular) that it is not aware of the cluster's capacity and can make pod unschedulable after applying a recommendation.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Dec 4, 2024
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 4, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ialidzhikov
Once this PR has been reviewed and has the lgtm label, please assign kwiesmueller for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 4, 2024
@ialidzhikov ialidzhikov force-pushed the enh/global-max-allowed-flags-post-processor branch 2 times, most recently from 9967539 to 0a79197 Compare December 4, 2024 14:02
@adrianmoisey
Copy link
Member

Sorry, but I've just gotten a PR merged (#7548) that will merge conflict with yours.
It's only documentation related.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 4, 2024
@ialidzhikov ialidzhikov force-pushed the enh/global-max-allowed-flags-post-processor branch from 0a79197 to af3a158 Compare December 4, 2024 14:29
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 4, 2024
ialidzhikov added a commit to ialidzhikov/gardener that referenced this pull request Dec 4, 2024
Comment on lines 224 to 229
var globalMaxAllowed apiv1.ResourceList
if !maxAllowedCPU.Quantity.IsZero() {
setGlobalMaxAllowed(&globalMaxAllowed, apiv1.ResourceCPU, maxAllowedCPU.Quantity)
}
if !maxAllowedMemory.Quantity.IsZero() {
setGlobalMaxAllowed(&globalMaxAllowed, apiv1.ResourceMemory, maxAllowedMemory.Quantity)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be some logic to ensure that the global max is greater than the global min?
It may be a bit strange since one is per container and the other per pod

Copy link
Contributor Author

@ialidzhikov ialidzhikov Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a perfect world, I agree that validation should exists.
However, I am not sure how to add such validation when the global min allowed flags are on Pod-level and global max allowed flags are on container-level.

In #7147 (comment) I suggested to deprecate the global Pod-level min allowed flags because:

  • they are on Pod-level, and not on container-level. The recommendation is on container-level. Right now, the recommender splits the Pod-level min allowed values to the number of containers in a Pod.
  • they are implemented as ResourceEstimator. This causes the VPA .status.uncappedTarget to be capped as well which is a contradiction of the definition of this field. This also causes the global min-allowed flags to overwrite the VPA minAllowed field (if specified), there is also no merge between the global Pod-level min allowed flags and the VPA's minAllowed field.

If you agree, I can open a dedicated issue for deprecated the global Pod-level min allowed flags and introduce new container-level min allowed equivalents. And validation can be added between the global container-level min and max allowed flags. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I am not sure how to add such validation when the global min allowed flags are on Pod-level and global max allowed flags are on container-level.

Yeah, I agree. I can't figure out a way to make the validation work.

In #7147 (comment) I suggested to deprecate the global Pod-level min allowed flags because:

I don't think deprecation is necessary yet. Pod level resources may be in the VPA's future, so those flags may be used for Pod level resources.

@@ -108,3 +109,16 @@ These options cannot be used together and are mutually exclusive.
It is possible to set the failurePolicy of the webhook to `Fail` by passing `--webhook-failure-policy-fail=true` to the VPA admission controller.
Please use this option with caution as it may be possible to break Pod creation if there is a failure with the VPA.
Using it in conjunction with `--ignored-vpa-object-namespaces=kube-system` or `--vpa-object-namespace` to reduce risk.

### Specifying global maximum allowed resources to prevent pods from being unschedulable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about moving this section move to the "features.md" page?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many sections in examples.md which actually describe features of VPA (Starting multiple recommenders, Custom memory bump-up after OOMKill, Using CPU management with static policy, Controlling eviction behavior based on scaling direction and resource, etc.). I don't think the newly introduced section is different from the existing section.
IMO, examples and features are overlapping conceptually. If you describe a feature, you usually also add example(s) of how the feature can be used. examples.md and features.md should be merged IMO. This is out-of-scope of the existing PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, examples and features are overlapping conceptually. If you describe a feature, you usually also add example(s) of how the feature can be used. examples.md and features.md should be merged IMO. This is out-of-scope of the existing PR.

I agree with you here. The features page is new, and I want to start moving the examples across to the features page, with more description about that feature.

I thought it would be nice for the "global max" feature to be added to features from the start, since it's a better fit there, and will require work to move at a later stage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(but it's up to you though, the documentation needs a lot of work in general)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to keep it consistent with the existing doc. In another PR all the examples doc can be reworked as features sections.

@adrianmoisey
Copy link
Member

Unsure if you want to do this, but an e2e test for the feature would be great!
If it's not something you want do, I'm happy to try figure it out in a different PR

@ialidzhikov ialidzhikov force-pushed the enh/global-max-allowed-flags-post-processor branch from 12f42ee to 04e3340 Compare December 13, 2024 12:33
@ialidzhikov
Copy link
Contributor Author

ialidzhikov commented Dec 13, 2024

Unsure if you want to do this, but an e2e test for the feature would be great!

Do you have a suggestion for the e2e test? It seems that we deploy vpa components to a kind cluster and then create/update/delete VPA objects and finally assert their state. It is relatively easy to test a field in the VPA spec. But I don't find example of e2e tests for such global flags/options.
The only option I see is to deploy VPA initially with the global max allowed flags. I am not sure if this is desired or not. It would also affect other recommender tests. Or do you suggest to patch the vpa-recommender deployment as part of the e2e test?

@ialidzhikov
Copy link
Contributor Author

ERROR: (gcloud.compute.networks.create) Could not fetch resource:
 - <!DOCTYPE html>
<html lang=en>
  <meta charset=utf-8>
  <meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width">
  <title>Error 502 (Server Error)!!1</title>
  <style>
    *{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px}
  </style>
  <a href=//www.google.com/><span id=logo aria-label=Google></span></a>
  <p><b>502.</b> <ins>That’s an error.</ins>
  <p>The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds.  <ins>That’s all we know.</ins>

/test pull-kubernetes-e2e-autoscaling-vpa-full

@adrianmoisey
Copy link
Member

Unsure if you want to do this, but an e2e test for the feature would be great!

Do you have a suggestion for the e2e test? It seems that we deploy vpa components to a kind cluster and then create/update/delete VPA objects and finally assert their state. It is relatively easy to test a field in the VPA spec. But I don't find example of e2e tests for such global flags/options. The only option I see is to deploy VPA initially with the global max allowed flags. I am not sure if this is desired or not. It would also affect other recommender tests. Or do you suggest to patch the vpa-recommender deployment as part of the e2e test?

It seems that at the moment we don't have a similar test to base yours off, so I think one would need to be written to deploy the recommender with a global max, and ensure that it doesn't go above that max.

It's going to be annoying to write such a test, I think, but I think the test will help us in the long run.

What do you think?

@adrianmoisey
Copy link
Member

I'm going to give this a
/lgtm

I like the idea of an e2e test, but will let an approver decide if it's needed or not

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 15, 2024
Copy link
Contributor

@raywainman raywainman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this!! Sorry for the delay!

vertical-pod-autoscaler/pkg/recommender/main.go Outdated Show resolved Hide resolved
vertical-pod-autoscaler/pkg/utils/vpa/capping.go Outdated Show resolved Hide resolved
} else {
// Set resources from the global maxAllowed if the VPA maxAllowed is missing them.
for resourceName, quantity := range globalMaxAllowed {
if _, ok := maxAllowed[resourceName]; !ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a comment here that we only override this if the user did not explicitly set a maximum in their container policy in the VPA.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already

// Set resources from the global maxAllowed if the VPA maxAllowed is missing them.
which is more or less stating the same thing.
Let me know if you have suggestions for improvements.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 18, 2024
@ialidzhikov
Copy link
Contributor Author

Hi @raywainman ,

I addressed your review feedback in caa47f9.

@adrianmoisey
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/vertical-pod-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

vpa-recommender: Make max allowed recommendation configurable
4 participants