Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2862: Graduate to BETA. #5016

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions keps/prod-readiness/sig-node/2862.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,5 @@
kep-number: 2862
alpha:
approver: "@jpbetz"
beta:
approver: "@jpbetz"
79 changes: 77 additions & 2 deletions keps/sig-node/2862-fine-grained-kubelet-authz/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -784,13 +784,20 @@ rollout. Similarly, consider large clusters and how enablement/disablement
will rollout across nodes.
-->

We have designed a fallback mechanism that prevents from failed rollouts or rollbacks
from impacting an already running workloads ability to interact with the kubelet API.

Please see the [Design Details](#design-details) section for more information.

###### What specific metrics should inform a rollback?

<!--
What signals should users be paying attention to when the feature is young
that might indicate a serious problem?
-->

Increase in failed requests to kubelet API from workloads.

###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

<!--
Expand All @@ -799,11 +806,28 @@ Longer term, we may want to require automated upgrade/rollback tests, but we
are missing a bunch of machinery and tooling and can't do that now.
-->

We have tested the following upgrade scenarios manually:

|Scenario| Result |
| -------|--------|
| Upgrade both kubelet and kube-apiserver so that feature gate is enabled in both. | workloads and kube-apiserver are able to reach kubelet|
| Upgrade only kubelet to enable the feature-gate | workloads and kube-apiserver are able to reach kubelet |
| Updrade only kube-apiserver to enable the feature-gate | workloads and kube-apiserver are able to reach kubelet |

We have tested the following rollback scenarios manually:

|Scenario| Result |
| -------|--------|
| Rollback both kubelet and kube-apiserver so that feature gate is disabled in both. | workloads and kube-apiserver are able to reach kubelet|
| Rollback only kubelet to disable the feature-gate | workloads and kube-apiserver are able to reach kubelet |
| Rollback only kube-apiserver to disable the feature-gate | workloads and kube-apiserver are able to reach kubelet |

###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

<!--
Even if applying deprecation policies, they may still surprise some users.
-->
No.

### Monitoring Requirements

Expand All @@ -822,6 +846,28 @@ checking if there are objects with field X set) may be a last resort. Avoid
logs or events for this purpose.
-->

Users can check if this feature is enabled in kube-apiserver by running the
following command:

```sh
kubectl get --raw /metrics | grep kubernetes_feature_enabled | grep KubeletFineGrainedAuthz
```

Users can check if this feature is nabled in the kubelet by running the
following command in a pod that is running on the node:

If readonly port is enabled:
```sh
curl http://<node-ip>:10255/metrics | grep kubernetes_feature_enabled | grep KubeletFineGrainedAuthz
```

If readonly port is not enabled:
```sh
curl -k https://$MY_NODE_IP:10250/metrics | grep kubernetes_feature_enabled | grep KubeletFineGrainedAuthz
```

NOTE: for port 10250 the pod will need to have the right RBAC bindings (if RBAC is enabled) to view the metrics.

###### How can someone using this feature know that it is working for their instance?

<!--
Expand All @@ -838,8 +884,8 @@ Recall that end users cannot usually observe component logs or access metrics.
- [ ] API .status
- Condition name:
- Other field:
- [ ] Other (treat as last resort)
- Details:
- [x] Other (treat as last resort)
- Details: By replacing `nodes/proxy` permission in RBAC with the fine-grained permissions required by the workload such as `nodes/metrics`, `nodes/pods` etc. and then confirming that the requests to kubelet succeed and don't encounter authorization errors.

###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?

Expand All @@ -858,6 +904,8 @@ These goals will help you determine what you need to measure (SLIs) in the next
question.
-->

Same SLOs as the kubelet API currently offers.

###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

<!--
Expand All @@ -871,13 +919,17 @@ Pick one more of these and delete the rest.
- [ ] Other (treat as last resort)
- Details:

Same SLIs as the kubelet API currently offers.

###### Are there any missing metrics that would be useful to have to improve observability of this feature?

<!--
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
implementation difficulties, etc.).
-->

No.

### Dependencies

<!--
Expand All @@ -901,6 +953,8 @@ and creating new ones, as well as about cluster-level services (e.g. DNS):
- Impact of its degraded performance or high-error rates on the feature:
-->

This feature only comes into play if kubelet authotization mode is set to Webhook.

### Scalability

<!--
Expand Down Expand Up @@ -1024,6 +1078,9 @@ details). For now, we leave it here.

###### How does this feature react if the API server and/or etcd is unavailable?

Not any different from how it would affect kubelet without this feature. If kube-apiserver
is unavailable any SAR from kubelet will fail.

###### What are other known failure modes?

<!--
Expand All @@ -1039,8 +1096,22 @@ For each of them, fill in the following information by copying the below templat
- Testing: Are there any tests for failure mode? If not, describe why.
-->

If requests to kubelet API start failing due to authorization issues users can
disabled the feature-gate.

Users can check the kubernetes Audit logs for SubjectAccessReview requests
created by `system:nodes:*` and check the reason they failed.

###### What steps should be taken if SLOs are not being met to determine the problem?

1. Check that the feature gate is enabled in kube-apiserver and kubelet.
2. Check that the workload has the right permissions. Requesets are expected to
fail if you are using fine-grained subresources but the feature gate is not enabled
in kubelet.
3. Check the audit logs for SubjectAccessReview requests created by `system:nodes:*`
and check the reason these requests failed.
4. Check kubelet logs.

## Implementation History

<!--
Expand All @@ -1054,6 +1125,10 @@ Major milestones might include:
- when the KEP was retired or superseded
-->

2024-09-28: [KEP-2862](https://github.com/kubernetes/enhancements/pull/4760) merged as implementable and PRR approved for ALPHA.
2024-10-17: Alpha Code implementation [PR](https://github.com/kubernetes/kubernetes/pull/126347) merged.
2024-10-22: Alpha Documentation [PR](https://github.com/kubernetes/website/pull/48412) merged.

## Drawbacks

<!--
Expand Down
5 changes: 3 additions & 2 deletions keps/sig-node/2862-fine-grained-kubelet-authz/kep.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,16 +19,17 @@ see-also:
replaces:

# The target maturity stage in the current dev cycle for this KEP.
stage: alpha
stage: beta

# The most recent milestone for which work toward delivery of this KEP has been
# done. This can be the current (upcoming) milestone, if it is being actively
# worked on.
latest-milestone: "v1.32"
latest-milestone: "v1.33"

# The milestone at which this feature was, or is targeted to be, at each stage.
milestone:
alpha: "v1.32"
beta: "v1.33"

# The following PRR answers are required at alpha release
# List the feature gate name and the components for which it must be enabled
Expand Down