Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build: build process polling healthcheck #11870

Open
agjohnson opened this issue Dec 23, 2024 · 3 comments
Open

Build: build process polling healthcheck #11870

agjohnson opened this issue Dec 23, 2024 · 3 comments
Labels
Needed: design decision A core team decision is required

Comments

@agjohnson
Copy link
Contributor

agjohnson commented Dec 23, 2024

When a build process goes unresponsive, we currently wait a period of time before calling the build timed out. This protects against builds that have an excessive build time and are never going to finish, but it also protects against scaling group events that cause instances to suddenly disappear and also don't terminate builds.

However, when builds do suddenly terminate, we have to wait for hours for the builds to finally be marked as terminated. This is a not great UX, but also affects our scaling metrics for hours. In the case of a mass instance termination event, this can break our ASG scaling until the builds are terminated.

Instead of using a timeout approach, we could instead use a healthcheck poll in each build process:

  • Async process in each build task starts up and polls a build healthcheck API once per minute
  • If a build hasn't had one healthcheck for 5 minutes, the build process is likely dead
  • The build is marked as finished/aborted

There are probably some side effects to consider and plan around:

  • Healthcheck fails to post to the API but the build doesn't fail. This seems unlikely
  • CPU usage could delay a health check. This is maybe more likely, but a wider window should solve this
  • ??

This would reduce the timeout window from hours to a few minutes and would help avoid ghost builds from affecting scaling group scaling.

@agjohnson agjohnson added the Needed: design decision A core team decision is required label Dec 23, 2024
@humitos
Copy link
Member

humitos commented Jan 2, 2025

Async process in each build task starts up and polls a build healthcheck API once per minute

I think this is the hardest part to do it correctly. What would you check for from the builder to know it's still running? I don't remember what is the builder state when the build has stopped and we can check for. Would it be enough to check for a running Docker container? Should we run the async process from inside the Docker container that runs the build process?

IIRC, we run sleep 3600 as the initial Docker container command and that's the time limit the build has available to run. So, should we run python healthcheck.py instead that runs forever sending healthcheck to web server API? In that case, if the Docker container fails for any reason, we will stop hitting the API and we can consider it dead.

@agjohnson
Copy link
Contributor Author

What would you check for from the builder to know it's still running?

I'm describing the build process just emitting this call to the API. Just the presence of this healthcheck is all that the application needs to satisfy the healthcheck.

It's just a canary check, we don't check anything specific. If the builder goes away or dies, this health check will fail.

There might be more cases to cover here though -- long builds or a critical process being OOM killed? But at very least this seems like it wouldn't handle any cases worse

So, should we run python healthcheck.py instead that runs forever sending healthcheck to web server API?

Yeah exactly 👍

@humitos
Copy link
Member

humitos commented Jan 7, 2025

In that case, the check that hits the API should run inside the Docker container as a separate process.

We should call our healthcheck.py script immediately after creating the container at

command=(
'/bin/sh -c "sleep {time}; exit {exit}"'.format(
time=self.container_time_limit,
exit=DOCKER_TIMEOUT_EXIT_CODE,
)
),

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needed: design decision A core team decision is required
Projects
None yet
Development

No branches or pull requests

2 participants