Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry for Machine reconciliation happening quicker than cache update leading to the object has been modified errors #767

Closed
himanshu-kun opened this issue Jan 16, 2023 · 5 comments · Fixed by #877
Assignees
Labels
area/performance Performance (across all domains, such as control plane, networking, storage, etc.) related area/robustness Robustness, reliability, resilience related kind/bug Bug needs/planning Needs (more) planning with other MCM maintainers priority/2 Priority (lower number equals higher priority) status/closed Issue is closed (either delivered or triaged)

Comments

@himanshu-kun
Copy link
Contributor

himanshu-kun commented Jan 16, 2023

How to categorize this issue?

/area robustness
/kind bug
/priority 2

What happened:
We have seen cases where the update of machine obj fails due to the object has been modified; please apply your changes to the latest version and try again errors.
Example

I0113 11:04:02.491801       1 machine.go:509] Machine labels/annotations UPDATE for "shoot--it--tmlf6-sy3-worker-1-z2-7f9bb-zplw2"

I0113 11:04:02.790670       1 core.go:203] Machine get request has been processed successfully for "shoot--it--tmlf6-sy3-worker-1-z2-7f9bb-zplw2"
I0113 11:04:02.822071       1 machine.go:537] Machine/status UPDATE for "shoot--it--tmlf6-sy3-worker-1-z2-7f9bb-zplw2" during creation
I0113 11:04:03.120387       1 core.go:203] Machine get request has been processed successfully for "shoot--it--tmlf6-sy3-worker-1-z2-7f9bb-zplw2"
W0113 11:04:03.147829       1 machine.go:535] Machine/status UPDATE failed for "shoot--it--tmlf6-sy3-worker-1-z2-7f9bb-zplw2". Retrying, error: Operation cannot be fulfilled on machines.machine.sapcloud.io "shoot--it--tmlf6-sy3-worker-1-z2-7f9bb-zplw2": the object has been modified; please apply your changes to the latest version and try again
W0113 11:04:03.147829       1 machine.go:535] Machine/status UPDATE failed for "shoot--it--tmlf6-sy3-worker-1-z2-7f9bb-zplw2". Retrying, error: Operation cannot be fulfilled on machines.machine.sapcloud.io "shoot--it--tmlf6-sy3-worker-1-z2-7f9bb-zplw2": the object has been modified; please apply your changes to the latest version and try again
I0113 11:04:25.815790       1 machine_util.go:628] Conditions of Machine "shoot--it--tmlf6-sy3-worker-1-z1-78f67-xfvch" with providerID "azure:///northeurope/shoot--it--tmlf6-sy3-worker-1-z1-78f67-xfvch" and backing node "shoot--it--tmlf6-sy3-worker-1-z1-78f67-xfvch" are changing

This could lead to our ShortRetry or MediumRetry kick in for the machine object and so the next reconcile could happen in min if not seconds. (here its around 20sec after which machine conditions started updating) . This could lead to machine conditions not updating quickly or machine obj not getting Running quickly.

This quick push in the queue is happening because we push machine objects currently on status updates also. Although in small clusters we see problems like described above , but in big clusters , it is helpful as with many machines in the queue, the machine object's turn could come quite late, so a quick push to the queue helps reducing that time.

What you expected to happen:
Next machine reconcile not delayed because of object has been modified errors.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • Others:
@gardener-robot gardener-robot added area/robustness Robustness, reliability, resilience related priority/2 Priority (lower number equals higher priority) labels Jan 16, 2023
@himanshu-kun
Copy link
Contributor Author

A soln is to treat the object has been modified as a special error, and re-push the obj in around 2 to 5 seconds if this is seen. In this time , the cache would be updated as well.
cc @rishabh-11

@himanshu-kun
Copy link
Contributor Author

A PR which ignores events of changes to status if the status is semantically equal in controller-runtime
apache/camel-k#3285

Could be worth looking into when working on #724

@unmarshall
Copy link
Contributor

An alternative could be to use SSA (server side apply). Also check reconstructive-controllers.

@himanshu-kun
Copy link
Contributor Author

google group discussion on this kind of issue -> https://groups.google.com/g/kubebuilder/c/tULj-TRM9ts?pli=1

@himanshu-kun
Copy link
Contributor Author

Solution decided post grooming

We saw that we face this problem primarily because of stale cache. Earlier the proposal was to let the cache sync by retrying the machine object after around 2 to 5 seconds

A soln is to treat the object has been modified as a special error, and re-push the obj in around 2 to 5 seconds if this is seen. In this time , the cache would be updated as well.

But then we decided to use WaitForCacheSync function. Currently since the problem is seen only for machine controller so we'll deal with it there by adding WaitForCacheSync right at the beginning of reconcileClusterMachine func.

@himanshu-kun himanshu-kun added area/performance Performance (across all domains, such as control plane, networking, storage, etc.) related needs/planning Needs (more) planning with other MCM maintainers labels Feb 17, 2023
@rishabh-11 rishabh-11 assigned himanshu-kun and unassigned piyuagr Nov 21, 2023
@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Dec 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance Performance (across all domains, such as control plane, networking, storage, etc.) related area/robustness Robustness, reliability, resilience related kind/bug Bug needs/planning Needs (more) planning with other MCM maintainers priority/2 Priority (lower number equals higher priority) status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants