-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-3386: Kubelet Evented PLEG for Better Performance #3387
Conversation
Skipping CI for Draft Pull Request. |
d00ba0b
to
43b6c17
Compare
|
||
### User Stories | ||
|
||
- As a cluster administrator I want to enable `Evented PLEG` feature of the kubelet for better performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add one where a cluster admin wants their nodes to perform with as little infrastructure overhead as possible
|
||
## Summary | ||
|
||
The purpose of this KEP is to outline changes to the Kubelet and Container Runtime Interface (CRI) that update the way the Kubelet monitors changes to pod and image state to an event-driven model, rather than a polling model. Specifically, the Kubelet will listen for [gRPC server streaming](https://grpc.io/docs/what-is-grpc/core-concepts/#server-streaming-rpc) events from the CRI implementation for events required for generating pod lifecycle events, as well as for keeping track of the state of images on the node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"the Kubelet monitors changes to container and image state"?
## Design Details | ||
|
||
### New Kubelet Parameter | ||
A new kubelet parameter, `EventedPleg`, will be introduced that can be set via [a config file](https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/) or as command line argument (i.e. `--evented-pleg=true`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use a featuregate instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should use a feature gate here 👍
CONTAINER_DELETED_EVENT = 3; | ||
} | ||
``` | ||
### Image Service Changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should narrow down the scope for the first iteration to just the container lifecycle events.
|
||
###### Does enabling the feature change any default behavior? | ||
|
||
This feature does not introduce any user facing changes. Although users should notice increased performance of the kubelet after enabling this feature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
increased performance -> reduced overhead of kubelet and the CRI runtime.
|
||
## Summary | ||
|
||
The purpose of this KEP is to outline changes to the Kubelet and Container Runtime Interface (CRI) that update the way the Kubelet monitors changes to pod and image state to an event-driven model, rather than a polling model. Specifically, the Kubelet will listen for [gRPC server streaming](https://grpc.io/docs/what-is-grpc/core-concepts/#server-streaming-rpc) events from the CRI implementation for events required for generating pod lifecycle events, as well as for keeping track of the state of images on the node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can say List/Watch model instead of evented.
### Goals | ||
|
||
- Reduce unnecessary work during inactivty (no spec/state changes) | ||
- In other words, Reduce steady-state CPU usage of Kubelet and CRI implementation by reducing frequent polling. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Reduce -> reduce
|
||
### Non-Goals | ||
|
||
- Completely elimanitate polling altogether. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: eliminate
### Non-Goals | ||
|
||
- Completely elimanitate polling altogether. | ||
- This proposal does not advocate completely removing the polling. We cannot solely rely on the upstream container events due to the possibility of missing events. PLEG should relist reduced frequency to ensure no events are missed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
relist at reduced...
|
||
As the number of pods on a node increases, the amount of time the Kubelet and CRI implementation takes in generating and reading this list increases linearly. What is needed is a way of the Kubelet being notified when a container or image changes state in a way it did not trigger. | ||
|
||
Luckily, there should only be two such cases, and in normal operation, only one would happen frequently: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Drop Luckily
|
||
PLEG is very core to the container status handling in the kubelet. Hence any miscalculation there would result in unpredictible behaviour not just for the node but for an entire cluster. | ||
|
||
To reduce the risk of regression, this feature initially will be available only as an opt-in. As discussed in the section [New Kubelet Parameter](###New-Kubelet-Parameter) unless this parameter is exclusively set, the kubelet PLEG will continue to function using frequent relisting as it's doing it today. If after enabling this feature, users notice any discrepancy in container statuses they can restart the kubelet either by omitting that option or setting it to `false` explicitly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can ensure that the existing tests function and add any addtional tests as needed to mitigate risks.
--> | ||
- CRI Runtime | ||
- CRI runtimes that are capable of emitting CRI events must be installed and running. | ||
- Impact of its outage on the feature: Kubelet will update pod statuses with the increased relisting period. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we do the fallback method described above, this will need updating
- Detection: If the user notices that the container statuses reported by the kubelet are not consistent with the container statuses reported by the CRI runtime (i.e. using say, `crictl`) then we are running into the failure of this feature. | ||
- Mitigations: They will have to disable this feature and open an issue for further investigation. | ||
- Diagnostics: CRI Runtime logs (such as, `cri-o` or `containerd`) may not be consistent with the kubelet logs on container statuses. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about missed events? it's not clear to me grpc does multiplexing, so the CRI will have to keep track of which client it has sent which events to. if crictl gets an events endpoint, and the server doesnt' support multiplexing, kubelet may not get an event
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
35c34d8
to
2f39bcf
Compare
e6c431c
to
182956a
Compare
- Any instability with the CRI runtime events stream that results in an error can be detected by the kubelet. Such an error will result in the kubelet falling back to the current default relisting period to make sure the pod statuses are updated in time. | ||
- If the instability is only of the form degraded performance but does not result in an error then the kubelet will not fall back to the current default relisting period and will continue to use the CRI runtime events stream. This will result in the kubelet updating the pod statuses with either the CRI runtime events or the increased relisting period, whichever is less. | ||
|
||
Without the stable stream CRI events this feature will suffer , and kubelet will fall back to relisting with the current default relisting period. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can a cluster-admin know this is happening in the cluster at scale? Is there something exposed which shows a difference between the state anticipated based on the event stream and state produced when the re-list happens? I think that's computable and necessary to know the watch has failed, even if it is a significantly trailing indicator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add an alert when we observe that there a delta that crosses a threshold to detect events missed when compared to a relist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added some notes about kube_pod_missed_events
metric the kubelet should emit
Signed-off-by: Harshal Patil <[email protected]> Signed-off-by: Peter Hunt <[email protected]>
182956a
to
9c413de
Compare
the PRR lgtm, looks like the sig still has some review to to, but I can approve just the PRR /approve |
/lgtm |
1 similar comment
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
/lgtm
I think the write-up is good to pursue alpha.
As we go into implementation, the exact metric name will need to be visited.
@vinaykul you may want to track this as well for in-place resize awareness.
- To reduce the risk of regression, this feature initially will be available only as an opt-in. | ||
- Users can disable this feature to make kubelet use existing relisting based PLEG. | ||
- Another risk is the CRI implementation could have a buggy event emitting system, and miss pod lifecycle events. | ||
- A mitigation is a `kube_pod_missed_events` metric, which the Kubelet could report when a lifecycle event is registered that wasn't triggered by an event, but rather by changes of state between lists. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we may want to refine this metric name during implementation, but i agree with concept.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have a metric today that is pleg_discard_events
for awareness, so it may be worth a pass to see if the name will confuse with existing metrics or not.
``` | ||
|
||
```protobuf= | ||
enum ContainerEventType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the kubelet internal pleg has an event for ContainerChanged and PodSync. we seem to ignore ContainerChanged now in the code base, and as long as any container modification is synchronous and atomic, it is fine. the one use case that we need to be careful with is resource requirement changes. if update container resources call changes cpu, memory requirements and only cpu is applied, memory cannot, we may find ourselves in a partial state.
Can we leave as a future discussion that we may see a need to support CONTAINER_CHANGED_EVENT
as we work through edge cases encountered with in-place resource resizing or any other interaction that may change the state of a running container?
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, derekwaynecarr, harche The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM in general to get started..
issues: I'm not convinced of the benefit of supporting N-different subscriptions per each container/pod to this partial set of PLEG events. Seems more likely that we need one subscription for all events for the k8s.io namespace? The filter id is in error.
You can't combine container and sandbox ids like this. Same on the response messages.. pod lifecycle events should not be overloaded with container lifecycle events. |
@derekwaynecarr Yes this is on my radar - supporting pod resize use case is currently a stretch goal but hopefully we can bring it in for beta. I'd like to throw out some of the PLEG changes I currently have. |
Although not explicitly mentioned in the KEP, I think it is technically possible that the kubelet subscribes events for all containers without using any filter, and kubelet figures out how to update the status for each container. On the container runtime side, container runtime should only return events in k8s.io namespace to the GetContainerEvents CRI. Does that make sense? Or should namespace be a parameter of this API? @harche @mrunalp please correct me if I am assuming too much here.
Can you elaborate why they cannot be mixed? More specifically, why |
It does make sense.. we (containerd) can scope our events to just the ones for the k8s.io namespace for this default CRI connection. If we wanted to allow for non-default subs for monitoring other container events outside the ones allocated in the k8s.io namespace this would be the place.. but I'm fine with punting that to some future item or just not doing it that way.
Sure, we support truncated index lookups for containers and pods. The podsandbox id passed in has to be full or if "truncated", has to match one and only one pod. Same for containers. We keep separate lists of sandboxes and containers each with their own index lookup. You can't request status of a pod using the pause container id of the pod. |
nod.. suspected the filter part wasn't necessary... probably more useful to crictl/debugging scenarios. |
.. is there a proposal for synchronizing the PLEG event subscription notifications and the macro/sync cycle CRI status verification? Trying to figure out if we also need to add timestamps regarding time of status in the status response for each pod/container, thus allowing one to identify/resolve pleg events with respect to the sync cycle status timings. |
There is no other enhancement around evented PLEG apart from this one at this point in time. |
- The second, and less likely case is when another entity comes and changes the state of the node. | ||
- For container related events (such as a container creating, starting, stopping or being killed), this can appear as a user calling [crictl](https://github.com/kubernetes-sigs/cri-tools/blob/master/docs/crictl.md) manually, or even using the runtime directly. | ||
|
||
The Kubelet currently covers each of thse cases quite easily: by listing all of the resources on the node, it will have an accurate picture after the amount of time of its [poll interval](https://github.com/kubernetes/kubernetes/blob/release-1.24/pkg/kubelet/kubelet.go#L162). For each of these cases, a new CRI-based events API can be made, using [gRPC server streaming](https://grpc.io/docs/what-is-grpc/core-concepts/#server-streaming-rpc). This way, the entity closest to the activity of the containers and pods (the CRI implementation) can be responsible for informing the Kubelet of their behavior directly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a gRPC connection is closed (due to a kubelet failure), how does the Kubelet "catch up" to the latest events?
I would assume that we would have to "ListContainerStatuses", then start the events channel for all events that occur after the "List". That requires coordination and buffering on the CRI side - a way to find all container statuses newer than the time the List was returned. That ensures that in the event of a Kubelet failure we don't miss any events.
I expected to see it described here how that would be solved, but I'm missing it. Is it because we're assuming the created_at
attribute of the Container Event can be filtered on? If so, how are we filtering that in the CRI and how is the Kubelet comparing the results of the List call to the start point of the events? CRI GetContainerEvents
does not have an argument that allows us to get "events after time T".
Can someone point me to where this is described?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation in Generic PLEG uses computeEvents that is called upon periodic relisting and which generates PLEG events based upon last updated pod status present in cache. Since, in case of Evented PLEG we aim to still re-use the relisting functionality of Generic PLEG (though with relisting being called at a higher frequency than usual 1s) and since the pod status(es) are persisted to common cache, that func should be still able to generate the events since the last update / gRPC connection failure (which forces a relist as well).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to see that explicitly described in the KEP so that in the future can refer to it. Right now I'm not confident in the design, and rather than debate implementation in the review (which might get lost) I'd just want to be sure that the KEP captures "how we expect to reliably ensure ordered deliver of events". Doesn't have to be as detailed as pseudocode, but I want to make sure the expectation is right.
I.e. a section in the KEP:
The pod worker expects to receive in order status updates in order, where no events are missed or delivered out of order (receiving a
delete
beforecreate
, for instance). We also expect to bound latency observing container updates as part of the motivation for this design. To achieve that, we will: . If the Kubelet restarts, or if the container runtime restarts, we expect to .
If necessary, just describing it in comments here first or in a proposed change to the KEP would be enough for me to review and then we can avoid iterating too much on implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to what Clayton suggests.
A little bit more on "how we expect to reliably ensure ordered deliver of events": We should be clear on how CRI should buffer container events. For example, if kubelet stops reading from grpc stream for any reason, we can't expect CRI to buffer the events indefinitely. Some events will be discarded if kubelet stops for too long. Kubelet should be able to handle this case.
Also, on the same front, regarding @swghosh's comment on
that func should be still able to generate the events since the last update / gRPC connection failure (which forces a relist as well).
Will kubelet need to reconcile the relist result and the container events, because relist always gives the latest status but events could be stale? For example, consider the following
- time-0: kubelet starts having trouble reading from grpc stream. No container events can be read by kubelet.
- time-1: containerA starts on container runtime. CRI tries to sends event {ContainerA starts} but fails. The event is buffered by CRI.
- time-2: containerA stops on container runtime. CRI tries to sends event {ContainerA stops} but fails. The event is buffered by CRI.
- time-3: kubelet relists and knows that containerA has stopped.
- time-4: kubelet resumes reading from grpc stream, reads {ContainerA starts}, {ContainerA stops}, and thinks containerA starts again and then stops.
Is the sequence above possible? Maybe this should be discussed on the PR, so I added a comment there as well (https://github.com/kubernetes/kubernetes/pull/111384/files#r974723039)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ruiwen-zhao and @smarterclayton for these important points. We experimented with a fix that should address these issues, and will update KEP and the implementation as soon as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it helps... state changes of a container occur at points in time that can easily be recorded into the container runtime state machine of the container and provided along with container status, and container events, ... createdOn, runningAt/unknownAsOf, exitedAt, deletedAt, updatedOn... see possible state transitions as of today: https://github.com/containerd/containerd/blob/main/pkg/cri/store/container/status.go#L31-L61
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both kubelet and the container runtime may be restarted resulting in unexpected/unknown states of shim managed containers, esp. while the container runtime is "restarting" https://github.com/containerd/containerd/blob/main/pkg/cri/server/restart.go
Signed-off-by: Harshal Patil [email protected]