Retry for Machine reconciliation happening quicker than cache update leading to `the object has been modified` errors #767

himanshu-kun · 2023-01-16T07:05:03Z

How to categorize this issue?

/area robustness
/kind bug
/priority 2

What happened:
We have seen cases where the update of machine obj fails due to the object has been modified; please apply your changes to the latest version and try again errors.
Example

I0113 11:04:02.491801       1 machine.go:509] Machine labels/annotations UPDATE for "shoot--it--tmlf6-sy3-worker-1-z2-7f9bb-zplw2"

I0113 11:04:02.790670       1 core.go:203] Machine get request has been processed successfully for "shoot--it--tmlf6-sy3-worker-1-z2-7f9bb-zplw2"
I0113 11:04:02.822071       1 machine.go:537] Machine/status UPDATE for "shoot--it--tmlf6-sy3-worker-1-z2-7f9bb-zplw2" during creation
I0113 11:04:03.120387       1 core.go:203] Machine get request has been processed successfully for "shoot--it--tmlf6-sy3-worker-1-z2-7f9bb-zplw2"
W0113 11:04:03.147829       1 machine.go:535] Machine/status UPDATE failed for "shoot--it--tmlf6-sy3-worker-1-z2-7f9bb-zplw2". Retrying, error: Operation cannot be fulfilled on machines.machine.sapcloud.io "shoot--it--tmlf6-sy3-worker-1-z2-7f9bb-zplw2": the object has been modified; please apply your changes to the latest version and try again
W0113 11:04:03.147829       1 machine.go:535] Machine/status UPDATE failed for "shoot--it--tmlf6-sy3-worker-1-z2-7f9bb-zplw2". Retrying, error: Operation cannot be fulfilled on machines.machine.sapcloud.io "shoot--it--tmlf6-sy3-worker-1-z2-7f9bb-zplw2": the object has been modified; please apply your changes to the latest version and try again
I0113 11:04:25.815790       1 machine_util.go:628] Conditions of Machine "shoot--it--tmlf6-sy3-worker-1-z1-78f67-xfvch" with providerID "azure:///northeurope/shoot--it--tmlf6-sy3-worker-1-z1-78f67-xfvch" and backing node "shoot--it--tmlf6-sy3-worker-1-z1-78f67-xfvch" are changing

This could lead to our ShortRetry or MediumRetry kick in for the machine object and so the next reconcile could happen in min if not seconds. (here its around 20sec after which machine conditions started updating) . This could lead to machine conditions not updating quickly or machine obj not getting Running quickly.

This quick push in the queue is happening because we push machine objects currently on status updates also. Although in small clusters we see problems like described above , but in big clusters , it is helpful as with many machines in the queue, the machine object's turn could come quite late, so a quick push to the queue helps reducing that time.

What you expected to happen:
Next machine reconcile not delayed because of object has been modified errors.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version):
Cloud provider or hardware configuration:
Others:

The text was updated successfully, but these errors were encountered:

himanshu-kun · 2023-01-16T07:06:22Z

A soln is to treat the object has been modified as a special error, and re-push the obj in around 2 to 5 seconds if this is seen. In this time , the cache would be updated as well.
cc @rishabh-11

himanshu-kun · 2023-01-16T07:56:45Z

A PR which ignores events of changes to status if the status is semantically equal in controller-runtime
apache/camel-k#3285

Could be worth looking into when working on #724

unmarshall · 2023-01-16T08:12:15Z

An alternative could be to use SSA (server side apply). Also check reconstructive-controllers.

himanshu-kun · 2023-01-17T08:44:01Z

google group discussion on this kind of issue -> https://groups.google.com/g/kubebuilder/c/tULj-TRM9ts?pli=1

himanshu-kun · 2023-02-17T08:33:28Z

Solution decided post grooming

We saw that we face this problem primarily because of stale cache. Earlier the proposal was to let the cache sync by retrying the machine object after around 2 to 5 seconds

A soln is to treat the object has been modified as a special error, and re-push the obj in around 2 to 5 seconds if this is seen. In this time , the cache would be updated as well.

But then we decided to use WaitForCacheSync function. Currently since the problem is seen only for machine controller so we'll deal with it there by adding WaitForCacheSync right at the beginning of reconcileClusterMachine func.

himanshu-kun added the kind/bug Bug label Jan 16, 2023

gardener-robot added area/robustness Robustness, reliability, resilience related priority/2 Priority (lower number equals higher priority) labels Jan 16, 2023

himanshu-kun added area/performance Performance (across all domains, such as control plane, networking, storage, etc.) related needs/planning Needs (more) planning with other MCM maintainers labels Feb 17, 2023

This was referenced Sep 12, 2023

Fix controller.machineStatusUpdate to retry on conflict #838

Closed

Support for IaaS machine tags for all machines in a worker pool #750

Open

himanshu-kun assigned piyuagr Oct 17, 2023

rishabh-11 assigned himanshu-kun and unassigned piyuagr Nov 21, 2023

himanshu-kun mentioned this issue Dec 1, 2023

Reduce noisy reconciles + enhance logs #877

Merged

2 tasks

aaronfern closed this as completed in #877 Dec 15, 2023

gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Dec 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry for Machine reconciliation happening quicker than cache update leading to `the object has been modified` errors #767

Retry for Machine reconciliation happening quicker than cache update leading to `the object has been modified` errors #767

himanshu-kun commented Jan 16, 2023 •

edited by piyuagr

Loading

himanshu-kun commented Jan 16, 2023

himanshu-kun commented Jan 16, 2023

unmarshall commented Jan 16, 2023

himanshu-kun commented Jan 17, 2023

himanshu-kun commented Feb 17, 2023

Retry for Machine reconciliation happening quicker than cache update leading to the object has been modified errors #767

Retry for Machine reconciliation happening quicker than cache update leading to the object has been modified errors #767

Comments

himanshu-kun commented Jan 16, 2023 • edited by piyuagr Loading

himanshu-kun commented Jan 16, 2023

himanshu-kun commented Jan 16, 2023

unmarshall commented Jan 16, 2023

himanshu-kun commented Jan 17, 2023

himanshu-kun commented Feb 17, 2023

Solution decided post grooming

Retry for Machine reconciliation happening quicker than cache update leading to `the object has been modified` errors #767

Retry for Machine reconciliation happening quicker than cache update leading to `the object has been modified` errors #767

himanshu-kun commented Jan 16, 2023 •

edited by piyuagr

Loading