Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds Health gRPC Server and Refactors Main() #148

Merged
merged 2 commits into from
Jan 10, 2025

Conversation

danehans
Copy link
Contributor

@danehans danehans commented Jan 4, 2025

Adds a health gRPC Server and refactors main() for better lifecycle management:

  • Introduced a health gRPC server to handle liveness and readiness probes.
  • Refactored main() to manage server goroutines using sync.WaitGroup.
  • Added graceful shutdown for servers and controller manager.
  • Improved logging consistency and ensured datastore readiness checks.

Fixes #96
Fixes #175

@k8s-ci-robot k8s-ci-robot requested a review from ahg-g January 4, 2025 01:02
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 4, 2025
@k8s-ci-robot k8s-ci-robot requested a review from kfswain January 4, 2025 01:02
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 4, 2025
@kfswain
Copy link
Collaborator

kfswain commented Jan 6, 2025

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 6, 2025
}
ready = true
return false
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At startup, I think we want to ensure that the extension did a sync with the api server and fetched the models, but not declare itself ready only if at least one model is defined.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The health probe now uses a client to check the API server for the configured InferencePool and that at least one InferenceModel exists in the same namespace. Should this probe also check that at least one InferenceModel references the configured InferencePool?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that the health check needs to block on at least one InferenceModel. On the other hand, since extension is currently 1:1 with InferencePool, I think it makes sense to ensure that the extension successfully initialized the assigned InferencePool.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to ensure that the extension successfully initialized the assigned InferencePool.

^ is the approach I took in the initial PR, e.g. check if InferencePool is nil in the data store. It also checked if at least 1 InferenceModel that referenced the configured InferencePool was stored but that can be removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry, I missed that. I agree with the original approach! I would still remove the check on InferenceModel.

pkg/ext-proc/main.go Outdated Show resolved Hide resolved
pkg/ext-proc/main.go Outdated Show resolved Hide resolved
pkg/ext-proc/main.go Outdated Show resolved Hide resolved
pkg/ext-proc/main.go Outdated Show resolved Hide resolved
pkg/ext-proc/main.go Outdated Show resolved Hide resolved
pkg/ext-proc/main.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 7, 2025
@k8s-ci-robot k8s-ci-robot removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jan 9, 2025
@danehans danehans requested a review from ahg-g January 9, 2025 16:04
@danehans danehans changed the title Adds Health gRPC Server and Refactor Main() Adds Health gRPC Server and Refactors Main() Jan 9, 2025
pkg/ext-proc/health.go Outdated Show resolved Hide resolved
pkg/ext-proc/main.go Outdated Show resolved Hide resolved
pkg/ext-proc/health.go Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 10, 2025
@danehans danehans mentioned this pull request Jan 10, 2025
12 tasks
- Introduced a health gRPC server to handle liveness and readiness probes.
- Refactored main() to manage server goroutines.
- Added graceful shutdown for servers and controller manager.
- Improved logging consistency and ensured.
- Validates CLI flags.

Signed-off-by: Daneyon Hansen <[email protected]>
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 10, 2025
@@ -124,7 +106,7 @@ func main() {
},
Record: mgr.GetEventRecorderFor("InferencePool"),
}).SetupWithManager(mgr); err != nil {
klog.Error(err, "Error setting up InferencePoolReconciler")
klog.Fatalf("Failed setting up InferencePoolReconciler: %v", err)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note the switch to Fatalf in several places where an error is critical to the extension's startup.

@danehans danehans requested a review from ahg-g January 10, 2025 22:12
@ahg-g
Copy link
Contributor

ahg-g commented Jan 10, 2025

/lgtm
/approve

Thanks!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 10, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, danehans

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 10, 2025
@k8s-ci-robot k8s-ci-robot merged commit 1b1d139 into kubernetes-sigs:main Jan 10, 2025
4 checks passed
@danehans danehans deleted the issue_96 branch January 10, 2025 23:28
kfswain pushed a commit to kfswain/llm-instance-gateway that referenced this pull request Jan 15, 2025
* Add health gRPC server and refactors main()

- Introduced a health gRPC server to handle liveness and readiness probes.
- Refactored main() to manage server goroutines.
- Added graceful shutdown for servers and controller manager.
- Improved logging consistency and ensured.
- Validates CLI flags.

Signed-off-by: Daneyon Hansen <[email protected]>

* Refactors health server to use data store

Signed-off-by: Daneyon Hansen <[email protected]>

---------

Signed-off-by: Daneyon Hansen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Validate EPP Flags Add a liveness/readiness endpoint to the extension
4 participants