-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Build: build process polling healthcheck #11870
Comments
I think this is the hardest part to do it correctly. What would you check for from the builder to know it's still running? I don't remember what is the builder state when the build has stopped and we can check for. Would it be enough to check for a running Docker container? Should we run the async process from inside the Docker container that runs the build process? IIRC, we run |
I'm describing the build process just emitting this call to the API. Just the presence of this healthcheck is all that the application needs to satisfy the healthcheck. It's just a canary check, we don't check anything specific. If the builder goes away or dies, this health check will fail. There might be more cases to cover here though -- long builds or a critical process being OOM killed? But at very least this seems like it wouldn't handle any cases worse
Yeah exactly 👍 |
In that case, the check that hits the API should run inside the Docker container as a separate process. We should call our readthedocs.org/readthedocs/doc_builder/environments.py Lines 815 to 820 in 882ebdb
|
When a build process goes unresponsive, we currently wait a period of time before calling the build timed out. This protects against builds that have an excessive build time and are never going to finish, but it also protects against scaling group events that cause instances to suddenly disappear and also don't terminate builds.
However, when builds do suddenly terminate, we have to wait for hours for the builds to finally be marked as terminated. This is a not great UX, but also affects our scaling metrics for hours. In the case of a mass instance termination event, this can break our ASG scaling until the builds are terminated.
Instead of using a timeout approach, we could instead use a healthcheck poll in each build process:
There are probably some side effects to consider and plan around:
This would reduce the timeout window from hours to a few minutes and would help avoid ghost builds from affecting scaling group scaling.
The text was updated successfully, but these errors were encountered: