-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: force strict mode for patch for safe concurrent writes #3912
base: main
Are you sure you want to change the base?
Conversation
Skipping CI for Draft Pull Request. |
✅ Deploy Preview for kubernetes-sigs-kueue ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
/test all |
/cc @mbobrovskyi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Thanks!
LGTM label has been added. Git tree hash: 8952d4f81acb656dc70e2e9ee998c643ab0a4228
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mbobrovskyi, mykysha The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
cc: @troychiu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK the nonstrict patch is the intended behavior.
Could you add test cases to prove the situation where the nonstrict patch removes non-owned finalizers (kueue.x-k8s. io/managed
) from the Pod?
Additionally, what is the reason for enforcing the strict patch everywhere?
I'm not sure about that actually, dropping finalizers successfully added by other controllers is not ideal.
Race conditions like this might be tricky to test. How would you imagine the test? e2e test which repeats the operation multiple times, or rather a unit test?
RemoveFinalizer was the only use of non-strict mode, so it makes sense to drop the flag if no other place wants to use it. However, I think there is some risk to use strict patches - it is the possible performance implication due to necessary re-tries. I don't expect it, because we anyway manage scheduling gates, and it wasn't a problem. However, for safety I would suggest to add a feature gate |
/uncc |
Yes, performance was my primary concern due to requeue objects to reconcilers. @mimowo WDYT? |
Sounds reasonable, I'm just wondering how much insight we get by artificial testing, but the scale requirement wouldn't be clear to me. Some users will use it at a large scale, and not sure how small scale testing will be representative. I suppose we would need to have at least 10000 pods. WDYT @tenzen-y @mykysha ? Alternative (or along with that) we could enable it by default and wait 3 releases with graduation for the feedback from users. |
Both approaches sound good to me. I believe the user-feedback approach would be more informative, however making some performance metering alongside definitely wouldn't hurt either |
I think we will not need a KEP process for that, but let me check if @tenzen-y is ok with the plan:
|
That makes sense. We should verify the performance issue in the large env to represent issue obviously.
I'm ok without KEP. But at least, could we summarize graduation criteria and background in the dedicated issue? |
Right, I think it might be tricky for automated testing, and will consume our build time. So, I'm leaning to just a manual experiment, either in a real cluster k8s or kwok. I think the scale of 10000 pods would be enough, they don't need to be running at the same time. Please also confirm how many of the errors we got in that case. |
SGTM, We can add TODO(# issue number) comment. @mykysha please open the issue and update the PR accordingly. |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
Enforce strict mode in Patch operation to remove possible race conditions.
Which issue(s) this PR fixes:
Fixes #3899
Special notes for your reviewer:
Does this PR introduce a user-facing change?