VPA: prune stale container aggregates, split recommendations over true number of containers #6745

jkyros · 2024-04-22T15:39:57Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

Previously we weren't cleaning up "stale" aggregates when container names changed (because of renames, removals) and that was resulting in:

VPAs showing recommendations for containers which no longer exist
Resources being split across containers which no longer exist (resulting in some containers ending up with resource limits too small for them to effectively live)
There was also a corner case where during a rollout after a container was renamed/removed from a deployment, we were counting the number of unique container names and not the actual number of containers in each pod, so we were splitting resources that shouldn't have been split.

This PR is an attempt to clean up those stale aggregates without incurring too much overhead, and make sure that the resources get spread across the correct number of containers during a rollout.

Which issue(s) this PR fixes:

Fixes #6744

Special notes for your reviewer:

There are probably a lot of different ways we can do the pruning of stale aggregates for missing containers:

I went with explicitly marking and sweeping them because it saved us an additional loop through all the pods and containers
We could also just as easily just have a PruneAggregates() that runs after LoadPods() that goes through everything and removes them (or do this work as part of LoadPods() but that seems...deceptive?)
We could probably also tweak the existing garbageCollectAggregateCollectionStates and run it immediately after LoadPods() every time but that might be expensive.

I'm not super-attached to any particular approach, I'd just like to fix this, so I can retool it if necessary.

If I am being ignorant, and there are corner cases I'm missing, absolutely let me know
it probably need some tests/cleanup and I'll change the names of things to...whatever you want them to be. 😄

Does this PR introduce a user-facing change?

Added pruning of container aggregates and changed container math so resources will no longer be split across the wrong number of containers

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2024-04-22T15:40:07Z

Hi @jkyros. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kwiesmueller · 2024-04-29T17:53:57Z

vertical-pod-autoscaler/pkg/recommender/input/cluster_feeder.go

+	// TODO(jkyros): This only removes the container state from the VPA's aggregate states, there
+	// is still a reference to them in feeder.clusterState.aggregateStateMap, and those get
+	// garbage collected eventually by the rate limited aggregate garbage collector later.
+	// Maybe we should clean those up here too since we know which ones are stale?


Is it a lot of extra work to do that? Do you see any risks doing it here?

No, I don't think it's a lot of extra work, it should be reasonably cheap to clean them up here since it's just deletions from the other maps if the keys exist, I just didn't know all the history.

It seemed possible at least that we were intentionally waiting to clean up the aggregates so if there was an unexpected hiccup we didn't just immediately blow away all that aggregate history we worked so hard to get? (Like maybe someone oopses, deletes their deployment, then puts it back? Right now we don't have to start over -- the pods come back in, find their container aggregates, and resume ? But if I clean them up here, we have to start over...)

kwiesmueller · 2024-04-29T17:58:31Z

vertical-pod-autoscaler/pkg/recommender/model/cluster.go

+// the correct number and not just the number of aggregates that have *ever* been present. (We don't want minimum resources
+// to erroneously shrink, either)
+func (cluster *ClusterState) setVPAContainersPerPod(pod *PodState) {
+	for _, vpa := range cluster.Vpas {


I'm wondering if there is already a place where this logic could go so we don't have to loop over all VPAs for every pod again here.
In large clusters with a VPA to Pod ratio that's closer to 1 this could be a little wasteful.

Hmm, yeah, I struggled with finding a less expensive way without making too much of a mess. Unless I'm missing something (and I might be) we don't seem to have a VPA <--> Pod map -- probably because we didn't need one until now? At the very least I think I should gate this to only run if the number of containers in the pod is > 1.

Like, I think our options are:

update the VPA as the pods roll through (which requires me to find the VPA for each pod like I did here) or

count the containers as we load the VPAs (but we load the VPAs before we load the pods, so we'd have to go through the pods again, so that doesn't help us)

have the VPA actually track the pods it's managing, something like this: jkyros@6ddc208 (could also just be an array of PodIDs and we could look up the state so we could save the memory cost of the PodState pointer, but you know what I mean)

I put it where I did (option 1) because at least LoadPods() was already looping through all the pods so we could freeload off the "outer" pod loop and I figured we didn't want to spend the memory on option 3. If we'd entertain option 3 and are okay with the memory usage, I can totally do that?

k8s-triage-robot · 2024-07-31T20:11:33Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-08-30T20:12:28Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

sreber84 · 2024-09-20T09:24:07Z

/remove-lifecycle rotten

adrianmoisey · 2024-12-03T11:41:10Z

/ok-to-test
/assign

I want to see if I can help get this merged

maxcao13 · 2024-12-03T21:33:48Z

Hi everyone, so John has take a hiatus and has left me with this PR, so after catching up, I guess we are still waiting for those conversations to resolve on which way do we want to go with those design decisions. The two commits I just put up are just a improvement on the existing implementation (assuming we will go with that, we don't have to), and some e2e tests to prove this works.

omerap12 · 2024-12-16T18:02:36Z

I think I figured out another edge-case that needs to be handled.

Sometimes in my day-job, when we're firefighting issues, we may scale a deployment down to zero to alleviate pressure on other parts of the system. I think this use-case is possibly common, tools such as https://keda.sh/ are built to allow users to scale workloads down to zero, to save costs.

During these times the recommendation would be removed. I think the grace period needs to apply to all resource types.

This should be able to work on all types as long as you set the field (unless I am misunderstanding you). Do you mean that we should default the gracePeriod to be very high (never expire) instead of 0? Maybe we also allow a global default gracePeriod set with a command line flag?

Yeah, I guess it makes sense to have a long default or keep the feature disabled and allow it to be opt-in.

I agree

maxcao13 · 2024-12-17T21:54:14Z

Okay, I've added a --pruning-grace-period-duration flag which defaults to 100 years meaning that aggregates will not expire which is the previous implementation before this PR (hopefully someone is not keeping a vpa-recommender container up for 100 years).

I removed any special handling of CronJob since the default pruning functionality should now opt-in and not be a breaking change.

EDIT: To make it clear for reviewers, this pruning of aggregate collection states/recommendations only applies to the state that is kept by each VPA. The state map that exists for the clusterState as a whole is unaffected by this change because that state gets GC'd on an hourly basis

autoscaler/vertical-pod-autoscaler/pkg/recommender/model/cluster.go

Line 364 in 7df4f84

    
           func (cluster *ClusterState) garbageCollectAggregateCollectionStates(ctx context.Context, now time.Time, controllerFetcher controllerfetcher.ControllerFetcher) {

.

So if the top-most-controller still exists, the gc will never prune the clusterState aggregates, but the per-vpa aggregates will not be pruned as long as the pruning-grace-period-duration for the container is long enough. However, if the top-most-controller no longer exists, the pod is in a non-active state, and an hour passes, the GC will prune both maps ignoring the grace period. This bug fix is so that renames/removals are caught early, and doesn't require the top-most-controller to not exist + we don't have to wait an entire hour -> as long as the user specifies the new per container pruningGracePeriod field.

Eventually if the top-most-controller not longer exists, the per-vpa aggregates will eventually get pruned anyways as mentioned in an earlier review comment.

adrianmoisey · 2024-12-20T06:18:00Z

vertical-pod-autoscaler/pkg/recommender/model/aggregate_container_state.go

@@ -46,6 +47,10 @@ import (
 	"k8s.io/autoscaler/vertical-pod-autoscaler/pkg/recommender/util"
 )

+var (
+	globalPruningGracePeriodDuration = flag.Duration("pruning-grace-period-duration", 24*365*100*time.Hour, `The grace period for deleting stale aggregates and recommendations. By default, set to 100 years, effectively disabling it.`)


I'm unsure what others think, but I'm not super excited by having the default be 100 years. My preference would be that setting the global to zero would disable the pruning feature (except for VPAs with pruningGracePeriod set)

Yeah, I agree it's not ideal. I was trying to figure out how I could enable/disable it with a single flag, and without specially handling a zero value for a duration. I guess I could switch to flag.String and specially handle a 0 as off. Or if there's some cleaner way to do it I'm happy for feedback.

For what it's worth, I found another flag that uses <= 0 as a way to disable it:

autoscaler/vertical-pod-autoscaler/pkg/updater/main.go

Lines 62 to 64 in ce01f02

evictionRateLimit = flag.Float64("eviction-rate-limit", -1,

`Number of pods that can be evicted per seconds. A rate limit set to 0 or -1 will disable

the rate limiter.`)

I changed to specially handle a flag.String in 72ae5d7

Previously we were dividing the resources per pod by the number of container aggregates, but in a situation where we're doing a rollout and the container names are changing (either a rename, or a removal) we're splitting resources across the wrong number of containers, resulting in smaller values than we should actually have. This collects a count of containers in the model when the pods are loaded, and uses the "high water mark value", so in the event we are doing something like adding a container during a rollout, we favor the pod that has the additional container. There are probably better ways to do this plumbing, but this was my initial attempt, and it does fix the issue.

Previously we were only cleaning checkpoints after something happened to the VPA or the targetRef, and so when a container got renamed the checkpoint would stick around forever. Since we're trying to clean up the aggregates immediately now, we need to force the checkpoint garbage collection to clean up any checkpoints that don't have matching aggregates. If the checkpoints did get loaded back in after a restart, PruneContainers() would take the aggregates back out, but we probably shouldn't leave the checkpoints out there. Signed-off-by: Max Cao <[email protected]>

Previously we were letting the rate limited garbage collector clean up the aggregate states, and that works really well in most cases, but when the list of containers in a pod changes, either due to the removal or rename of a container, the aggregates for the old containers stick around forever and cause problems. To get around this, this marks all existing aggregates/initial aggregates in the list for each VPA as "not under a VPA" every time before we LoadPods(), and then LoadPods() will re-mark the aggregates as "under a VPA" for all the ones that are still there, which lets us easily prune the stale container aggregates that are still marked as "not under a VPA" but are still wrongly in the VPA's list. This does leave the ultimate garbage collection to the rate limited garbage collector, which should be fine, we just needed the stale entries to get removed from the per-VPA lists so they didn't affect VPA behavior.

Signed-off-by: Max Cao <[email protected]>

… containers Signed-off-by: Max Cao <[email protected]>

…pa prunes recommendations from non-existente containers Signed-off-by: Max Cao <[email protected]>

…on-breaking opt-in change Signed-off-by: Max Cao <[email protected]>

adrianmoisey · 2025-01-03T13:32:13Z

Generally this seems OK to me.
I just want to give it a test drive locally before giving it a lgtm

I'd also like other approvers to weigh in here too

adrianmoisey · 2025-01-03T18:54:50Z

vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go

+	// By default, recommendations for non-existent containers are never pruned until its top-most controller is deleted,
+	// after which the recommendations are subject to the VPA's recommendation garbage collector.
+	// +optional
+	PruningGracePeriod *metav1.Duration `json:"pruningGracePeriod,omitempty" protobuf:"bytes,4,opt,name=pruningGracePeriod"`


I'm unsure how others feel about this, but this change adds PruningGracePeriod to verticalpodautoscaler.spec.resourcePolicy.containerPolicies

Where the description of containerPolicies is:

Per-container resource policies.
ContainerResourcePolicy controls how autoscaler computes the recommended
resources for a specific container.

Technically speaking, PruningGracePeriod isn't related to how the VPA generates recommendations. It feels wrong to put it there, but I don't know of a better location to put it.

An idea I have, which I'm not super excited by, is to use an annotation on the VPA object to drive this.

I feel the flags are a bit messy. Some are in the VPA object, some are global in the recommender, and some are in annotations. I think most flags related to how the VPA generates recommendations should be in the VPA object, while others should go in annotations. So yes, I agree.

Sure, makes sense to me. Thanks!

Changed in 70826a0

…d of containerPolicy Signed-off-by: Max Cao <[email protected]>

… instead of containerPolicy Signed-off-by: Max Cao <[email protected]>

omerap12 · 2025-01-04T07:04:49Z

vertical-pod-autoscaler/pkg/recommender/model/aggregate_container_state.go

+	}
+	duration, err := time.ParseDuration(*globalPruningGracePeriodDuration)
+	if err != nil {
+		panic(fmt.Sprintf("Failed to parse --pruning-grace-period-duration: %v", err))


Instead of using panic(), we should return an error to be handled by the caller of this function. Could we modify this to return an error using klog.ErrorS() followed by os.Exit(255)?
What do you think?

Fixed in b2ae65a, does it make sense? I used klog.Fatalf instead of error and exit.

Yes, that's fine, but could you use Errors and Exit? It's the correct approach for structural logging.

Hopefully cd63322 is correct? 🤞

adrianmoisey · 2025-01-05T13:17:38Z

There's potentially something wrong here. I tested it locally.
I made a deploy with 2 containers, and set the annotation vpaPruningGracePeriod: 70s on the VPA (may be this value is too short? I'm unsure).

Everything was fine, I was getting recommendations as expected.

I then deleted the second container, and eventually saw this in the logs:

I0105 13:04:21.715400       1 cluster_feeder.go:448] "Deleting Aggregate for VPA: container no longer present" namespace="default" vpaName="hamster-vpa" containerName="hamster"
I0105 13:04:21.715442       1 cluster_feeder.go:448] "Deleting Aggregate for VPA: container no longer present" namespace="default" vpaName="hamster-vpa" containerName="hamster2"

hamster2 was the only container I had removed, but for some reason it removed both.

A while later I re-added hamster2, and eventually the VPA did this:

I0105 13:10:21.726150       1 cluster_feeder.go:448] "Deleting Aggregate for VPA: container no longer present" namespace="default" vpaName="hamster-vpa" containerName="hamster"

My guess is that it's related to the MarkAggregates() run, but I'm not too sure.

maxcao13 · 2025-01-06T21:32:07Z

There's potentially something wrong here. I tested it locally. I made a deploy with 2 containers, and set the annotation vpaPruningGracePeriod: 70s on the VPA (may be this value is too short? I'm unsure).

Everything was fine, I was getting recommendations as expected.

I then deleted the second container, and eventually saw this in the logs:
...
My guess is that it's related to the MarkAggregates() run, but I'm not too sure.

What's happening (I think) is that removing the container makes the deployment deploy a new pod, and the VPA doesn't associate both of the old container aggregates with the new pod, so it never marks isUnderVPA=true for the next LoadPods loop.

From what I can tell, this is probably okay because AggregateStateKey includes info about the pod's labelSet to make them unique (is there a situation where this breaks?). We don't want the old pod's aggregate to contribute to the recommendation anymore (assuming that's the reason someone sets the pruning grace period).

…e for non-breaking opt-in change Signed-off-by: Max Cao <[email protected]>

k8s-ci-robot · 2025-01-06T22:56:57Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jkyros
Once this PR has been reviewed and has the lgtm label, please assign kgolab for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

vertical-pod-autoscaler/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

adrianmoisey · 2025-01-07T19:53:28Z

There's potentially something wrong here. I tested it locally. I made a deploy with 2 containers, and set the annotation vpaPruningGracePeriod: 70s on the VPA (may be this value is too short? I'm unsure).
Everything was fine, I was getting recommendations as expected.
I then deleted the second container, and eventually saw this in the logs:
...
My guess is that it's related to the MarkAggregates() run, but I'm not too sure.

What's happening (I think) is that removing the container makes the deployment deploy a new pod, and the VPA doesn't associate both of the old container aggregates with the new pod, so it never marks isUnderVPA=true for the next LoadPods loop.

From what I can tell, this is probably okay because AggregateStateKey includes info about the pod's labelSet to make them unique (is there a situation where this breaks?). We don't want the old pod's aggregate to contribute to the recommendation anymore (assuming that's the reason someone sets the pruning grace period).

That theory sounds plausible. I worry that it's confusing to the user though, since it sounds like their recommendation gets removed.

What I'm mostly worried about is this part:

A while later I re-added hamster2, and eventually the VPA did this:

I0105 13:10:21.726150 1 cluster_feeder.go:448] "Deleting Aggregate for VPA: container no longer present" namespace="default" vpaName="hamster-vpa" containerName="hamster"
My guess is that it's related to the MarkAggregates() run, but I'm not too sure.

When this happened, the recommendations were deleted, and the status of my VPA looked like this:

status:
  conditions:
  - lastTransitionTime: "2025-01-07T19:50:10Z"
    status: "False"
    type: RecommendationProvided
  recommendation: {}

…to be linked to new containers We need to set VPAContainersPerPod for a VPA correctly so we can split resources correctly on its first run through the recommendation loop. So I opted to explicitly set it after updating the pod's containers. This also allows old aggregateContainerStates that were previously created from a removed Pod's container, to be reused by a new Pod's container that shares the same VPA and targetRef. This allows recommendations to be updated correctly when aggregates are pruned or created. Signed-off-by: Max Cao <[email protected]>

Signed-off-by: Max Cao <[email protected]>

…ong time for non-breaking opt-in change Signed-off-by: Max Cao <[email protected]>

maxcao13 · 2025-01-08T02:40:28Z

That theory sounds plausible. I worry that it's confusing to the user though, since it sounds like their recommendation gets removed.

What I'm mostly worried about is this part:

A while later I re-added hamster2, and eventually the VPA did this:

I0105 13:10:21.726150 1 cluster_feeder.go:448] "Deleting Aggregate for VPA: container no longer present" namespace="default" vpaName="hamster-vpa" containerName="hamster"
My guess is that it's related to the MarkAggregates() run, but I'm not too sure.

When this happened, the recommendations were deleted, and the status of my VPA looked like this:
status:
  conditions:
  - lastTransitionTime: "2025-01-07T19:50:10Z"
    status: "False"
    type: RecommendationProvided
  recommendation: {}

I wasn't able to make the bug appear that removes the recommendations completely, but I noticed that the pruning stale aggregates wouldn't update the number of container recommendations to the new number of pod containers, which is probably the same bug.

I fixed this in 11bcee3 hopefully. I needed to link old aggregates that were used by a previous pod/container, to the new pod/container explicitly. This is because the cluster.findOrCreateAggregateContainerState(containerID) function will not create a new aggregate. It finds the old one, and never calls UseAggregationIfMatching which never links this aggregation to the VPA object if the old aggregates are pruned. Then the GetContainerNameToAggregateStateMap function would be able to aggregate all the containerAggregateStates (super confusing naming) by container name based on how many non-stale aggregates actually still exist for the recommender post processor.

As long as the container name and namespace is the same, all of these aggregates will contribute to a recommendation for that container (hence the name aggregates I guess 😛). PruningGracePeriod just lets you decide if "old enough" containers can still contribute or not.

adrianmoisey · 2025-01-08T12:09:46Z

vertical-pod-autoscaler/pkg/recommender/logic/recommender.go

 	var recommendation = make(RecommendedPodResources)
 	if len(containerNameToAggregateStateMap) == 0 {
 		return recommendation
 	}

-	fraction := 1.0 / float64(len(containerNameToAggregateStateMap))
+	fraction := 1.0 / float64(containersPerPod)
+	klog.V(4).Infof("Spreading recommendation across %d containers (fraction %f)", containersPerPod, fraction)


Could you switch to structured logging here... something like:

Suggested change

klog.V(4).Infof("Spreading recommendation across %d containers (fraction %f)", containersPerPod, fraction)

klog.V(4).InfoS("Spreading recommendation across containers", "containerCount", containersPerPod, "fraction", fraction)

adrianmoisey · 2025-01-08T12:27:03Z

I wasn't able to make the bug appear that removes the recommendations completely, but I noticed that the pruning stale aggregates wouldn't update the number of container recommendations to the new number of pod containers, which is probably the same bug.

Yeah, it seems to be fixed now.

Something I noticed, was that the old recommendation still exists.

When I have 1 containers per Pod:

status:
  conditions:
  - lastTransitionTime: "2025-01-08T11:56:36Z"
    status: "True"
    type: RecommendationProvided
  recommendation:
    containerRecommendations:
    - containerName: hamster
      lowerBound:
        cpu: 527m
        memory: 131072k
      target:
        cpu: 627m
        memory: 131072k
      uncappedTarget:
        cpu: 627m
        memory: 131072k
      upperBound:
        cpu: "1"
        memory: 500Mi
    - containerName: hamster2
      lowerBound:
        cpu: 537m
        memory: 131072k
      target:
        cpu: 627m
        memory: 131072k
      uncappedTarget:
        cpu: 627m
        memory: 131072k
      upperBound:
        cpu: "1"
        memory: 500Mi

(notice that the memory is half of the default value of --pod-recommendation-min-memory-mb, as expected).

Then when I remove a container from the Deployment, I get this:

status:
  conditions:
  - lastTransitionTime: "2025-01-08T11:56:36Z"
    status: "True"
    type: RecommendationProvided
  recommendation:
    containerRecommendations:
    - containerName: hamster
      lowerBound:
        cpu: 558m
        memory: 262144k
      target:
        cpu: 627m
        memory: 262144k
      uncappedTarget:
        cpu: 627m
        memory: 262144k
      upperBound:
        cpu: "1"
        memory: 500Mi
    - containerName: hamster2
      lowerBound:
        cpu: 210m
        memory: 262144k
      target:
        cpu: 627m
        memory: 262144k
      uncappedTarget:
        cpu: 627m
        memory: 262144k
      upperBound:
        cpu: "1"
        memory: 500Mi

Notice that the second (now removed) container persists, and the memory (for both containers) is now correct.

I had assumed that as part of this PR, the removed recommendation should be removed from the status field.

maxcao13 · 2025-01-08T19:52:29Z

I assume for your VPA object, there is no pruningGracePeriod set or it's very long? I get the same behaviour when I do that.

If so, yeah, I guess that's a side effect of the aggregates not having getting pruned - that there will be stale recommendations still in the VPA (remember CronJobs pods!). To make it less confusing to the user, maybe there could be an extra field to mark it as visibly stale? Maybe it would also make more sense that the recommendation for hamster2 is not updated at all if it goes away?
e.g.
vpaPruningGracePeriod not set; container "hamster2" was removed"

    recommendation:
      containerRecommendations:
      - containerName: hamster
        lowerBound:
          cpu: 582m
          memory: 262144k
        target:
          cpu: 587m
          memory: 262144k
        uncappedTarget:
          cpu: 587m
          memory: 262144k
        upperBound:
          cpu: "1"
          memory: 262144k
      - containerName: hamster2
        lowerBound:
          cpu: 403m
          memory: same as before
        target:
          cpu: 587m
          memory: same as before
        uncappedTarget:
          cpu: 587m
          memory: same as before
        upperBound:
          cpu: "1"
          memory: 500Mi
        stale: true

EDIT:

Alternatively, I just thought of this, and if we don't want stale recommendations to appear regardless of grace period, but still want cronjob recommendations to appear, then maybe we can do:

If at least one pod under a targetRef for a VPA object exists, but some container for that pod does not exist anymore, don't keep the recommendation for that non-existent container but keep the aggregate in-memory in case it comes back.
If there are no pods are under a targetRef for a VPA object (but targetRef/top-level-controller still exists), do not prune and keep all the recommendations in the VPA object, and all the aggregates.

Thinking about it, I think this alternative solution would remove the need for the pruningGracePeriod altogether, and maybe it's not useful to keep as a feature.

I think I figured out another edge-case that needs to be handled.

Sometimes in my day-job, when we're firefighting issues, we may scale a deployment down to zero to alleviate pressure on other parts of the system. I think this use-case is possibly common, tools such as https://keda.sh/ are built to allow users to scale workloads down to zero, to save costs.

The issue here, would still be covered as if there are no pods for a deployment, we don't remove anything, and the cronJob issue would be solved as well. Is there any bugs I'm missing in this solution?

adrianmoisey · 2025-01-09T04:39:55Z

Oh hold on, I was too impatient and wasn't waiting for the grace period to actually remove recommendation. My bad!

Maybe it would also make more sense that the recommendation for hamster2 is not updated at all if it goes away?

I think this may make sense, updating the recommendation for a container that's about to be removed seems to show a sign of life, may be it needs to be ignored instead.

The issue here, would still be covered as if there are no pods for a deployment, we don't remove anything, and the cronJob issue would be solved as well. Is there any bugs I'm missing in this solution?

Nope, I think the solution is fine at the moment (besides the point above, which is a mild confusion, and not a big deal), I was just not waiting the grace period that I had set.

I'll give this a few more tests though, just to see if anything else crops up

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. labels Apr 22, 2024

k8s-ci-robot requested review from krzysied and voelzmo April 22, 2024 15:40

k8s-ci-robot added area/vertical-pod-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 22, 2024

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 22, 2024

jkyros marked this pull request as ready for review April 22, 2024 17:56

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 22, 2024

k8s-ci-robot requested a review from kgolab April 22, 2024 17:57

kwiesmueller reviewed Apr 29, 2024

View reviewed changes

voelzmo mentioned this pull request Jun 3, 2024

VPA updater constantly fails to match the container that doesn't even exists #6215

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 31, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 30, 2024

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 20, 2024

voelzmo mentioned this pull request Dec 3, 2024

Pass the whole VPA into cappingRecommendationProcessor.Apply() #7527

Merged

k8s-ci-robot assigned adrianmoisey Dec 3, 2024

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 3, 2024

maxcao13 force-pushed the vpa-aggregates-fix-mark-sweep branch 2 times, most recently from c3a1f0e to c84075b Compare December 3, 2024 22:29

maxcao13 force-pushed the vpa-aggregates-fix-mark-sweep branch from 098c138 to 808c62b Compare December 17, 2024 21:52

adrianmoisey reviewed Dec 20, 2024

View reviewed changes

jkyros and others added 7 commits January 2, 2025 08:17

VPA: Slightly improve runtime on splitting aggregates

679eafe

Signed-off-by: Max Cao <[email protected]>

VPA: Add e2e test for spliting recommendations when removing/renaming…

4c522a0

… containers Signed-off-by: Max Cao <[email protected]>

Introduce pruningGracePeriod which allows a grace period before the v…

e6a8d24

…pa prunes recommendations from non-existente containers Signed-off-by: Max Cao <[email protected]>

Add globalPruningGracePeriod flag which defaults to a long time for n…

72ae5d7

…on-breaking opt-in change Signed-off-by: Max Cao <[email protected]>

maxcao13 force-pushed the vpa-aggregates-fix-mark-sweep branch from 808c62b to 72ae5d7 Compare January 2, 2025 17:41

adrianmoisey reviewed Jan 3, 2025

View reviewed changes

maxcao13 added 2 commits January 3, 2025 16:15

VPA: refactor pruning grace period as vpa apiObject annotation instea…

70826a0

…d of containerPolicy Signed-off-by: Max Cao <[email protected]>

fixup! VPA: refactor pruning grace period as vpa apiObject annotation…

b5e69fe

… instead of containerPolicy Signed-off-by: Max Cao <[email protected]>

omerap12 reviewed Jan 4, 2025

View reviewed changes

fixup! Add globalPruningGracePeriod flag which defaults to a long tim…

b2ae65a

…e for non-breaking opt-in change Signed-off-by: Max Cao <[email protected]>

maxcao13 added 3 commits January 7, 2025 17:55

fixup! VPA: immediately prune stale vpa aggregates

e9e5550

Signed-off-by: Max Cao <[email protected]>

fixup! fixup! Add globalPruningGracePeriod flag which defaults to a l…

cd63322

…ong time for non-breaking opt-in change Signed-off-by: Max Cao <[email protected]>

adrianmoisey reviewed Jan 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VPA: prune stale container aggregates, split recommendations over true number of containers #6745

VPA: prune stale container aggregates, split recommendations over true number of containers #6745

jkyros commented Apr 22, 2024

k8s-ci-robot commented Apr 22, 2024

kwiesmueller Apr 29, 2024

jkyros May 2, 2024 •

edited

Loading

kwiesmueller Apr 29, 2024

jkyros May 2, 2024

k8s-triage-robot commented Jul 31, 2024

k8s-triage-robot commented Aug 30, 2024

sreber84 commented Sep 20, 2024

adrianmoisey commented Dec 3, 2024

maxcao13 commented Dec 3, 2024

omerap12 commented Dec 16, 2024

maxcao13 commented Dec 17, 2024 •

edited

Loading

adrianmoisey Dec 20, 2024

maxcao13 Dec 20, 2024

adrianmoisey Dec 28, 2024

maxcao13 Jan 2, 2025

adrianmoisey commented Jan 3, 2025

adrianmoisey Jan 3, 2025

omerap12 Jan 3, 2025

maxcao13 Jan 3, 2025

maxcao13 Jan 4, 2025

omerap12 Jan 4, 2025

maxcao13 Jan 6, 2025

omerap12 Jan 7, 2025

maxcao13 Jan 8, 2025

adrianmoisey commented Jan 5, 2025

maxcao13 commented Jan 6, 2025 •

edited

Loading

k8s-ci-robot commented Jan 6, 2025

adrianmoisey commented Jan 7, 2025

maxcao13 commented Jan 8, 2025 •

edited

Loading

adrianmoisey Jan 8, 2025

adrianmoisey commented Jan 8, 2025

maxcao13 commented Jan 8, 2025 •

edited

Loading

adrianmoisey commented Jan 9, 2025

	evictionRateLimit = flag.Float64("eviction-rate-limit", -1,
	`Number of pods that can be evicted per seconds. A rate limit set to 0 or -1 will disable
	the rate limiter.`)

	klog.V(4).Infof("Spreading recommendation across %d containers (fraction %f)", containersPerPod, fraction)
	klog.V(4).InfoS("Spreading recommendation across containers", "containerCount", containersPerPod, "fraction", fraction)

VPA: prune stale container aggregates, split recommendations over true number of containers #6745

Are you sure you want to change the base?

VPA: prune stale container aggregates, split recommendations over true number of containers #6745

Conversation

jkyros commented Apr 22, 2024

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot commented Apr 22, 2024

Choose a reason for hiding this comment

jkyros May 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-triage-robot commented Jul 31, 2024

k8s-triage-robot commented Aug 30, 2024

sreber84 commented Sep 20, 2024

adrianmoisey commented Dec 3, 2024

maxcao13 commented Dec 3, 2024

omerap12 commented Dec 16, 2024

maxcao13 commented Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adrianmoisey commented Jan 3, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adrianmoisey commented Jan 5, 2025

maxcao13 commented Jan 6, 2025 • edited Loading

k8s-ci-robot commented Jan 6, 2025

adrianmoisey commented Jan 7, 2025

maxcao13 commented Jan 8, 2025 • edited Loading

Choose a reason for hiding this comment

adrianmoisey commented Jan 8, 2025

maxcao13 commented Jan 8, 2025 • edited Loading

adrianmoisey commented Jan 9, 2025

jkyros May 2, 2024 •

edited

Loading

maxcao13 commented Dec 17, 2024 •

edited

Loading

maxcao13 commented Jan 6, 2025 •

edited

Loading

maxcao13 commented Jan 8, 2025 •

edited

Loading

maxcao13 commented Jan 8, 2025 •

edited

Loading