Build: build process polling healthcheck #11870

agjohnson · 2024-12-23T18:32:10Z

When a build process goes unresponsive, we currently wait a period of time before calling the build timed out. This protects against builds that have an excessive build time and are never going to finish, but it also protects against scaling group events that cause instances to suddenly disappear and also don't terminate builds.

However, when builds do suddenly terminate, we have to wait for hours for the builds to finally be marked as terminated. This is a not great UX, but also affects our scaling metrics for hours. In the case of a mass instance termination event, this can break our ASG scaling until the builds are terminated.

Instead of using a timeout approach, we could instead use a healthcheck poll in each build process:

Async process in each build task starts up and polls a build healthcheck API once per minute
If a build hasn't had one healthcheck for 5 minutes, the build process is likely dead
The build is marked as finished/aborted

There are probably some side effects to consider and plan around:

Healthcheck fails to post to the API but the build doesn't fail. This seems unlikely
CPU usage could delay a health check. This is maybe more likely, but a wider window should solve this
??

This would reduce the timeout window from hours to a few minutes and would help avoid ghost builds from affecting scaling group scaling.

humitos · 2025-01-02T15:30:37Z

Async process in each build task starts up and polls a build healthcheck API once per minute

I think this is the hardest part to do it correctly. What would you check for from the builder to know it's still running? I don't remember what is the builder state when the build has stopped and we can check for. Would it be enough to check for a running Docker container? Should we run the async process from inside the Docker container that runs the build process?

IIRC, we run sleep 3600 as the initial Docker container command and that's the time limit the build has available to run. So, should we run python healthcheck.py instead that runs forever sending healthcheck to web server API? In that case, if the Docker container fails for any reason, we will stop hitting the API and we can consider it dead.

agjohnson · 2025-01-02T17:56:55Z

What would you check for from the builder to know it's still running?

I'm describing the build process just emitting this call to the API. Just the presence of this healthcheck is all that the application needs to satisfy the healthcheck.

It's just a canary check, we don't check anything specific. If the builder goes away or dies, this health check will fail.

There might be more cases to cover here though -- long builds or a critical process being OOM killed? But at very least this seems like it wouldn't handle any cases worse

So, should we run python healthcheck.py instead that runs forever sending healthcheck to web server API?

Yeah exactly 👍

humitos · 2025-01-07T14:30:26Z

In that case, the check that hits the API should run inside the Docker container as a separate process.

We should call our healthcheck.py script immediately after creating the container at

readthedocs.org/readthedocs/doc_builder/environments.py

Lines 815 to 820 in 882ebdb

    
           command=( 
        
               '/bin/sh -c "sleep {time}; exit {exit}"'.format( 
        
                   time=self.container_time_limit, 
        
                   exit=DOCKER_TIMEOUT_EXIT_CODE, 
        
               ) 
        
           ),

agjohnson added the Needed: design decision A core team decision is required label Dec 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build: build process polling healthcheck #11870

Build: build process polling healthcheck #11870

agjohnson commented Dec 23, 2024 •

edited

Loading

humitos commented Jan 2, 2025

agjohnson commented Jan 2, 2025

humitos commented Jan 7, 2025

Build: build process polling healthcheck #11870

Build: build process polling healthcheck #11870

Comments

agjohnson commented Dec 23, 2024 • edited Loading

humitos commented Jan 2, 2025

agjohnson commented Jan 2, 2025

humitos commented Jan 7, 2025

agjohnson commented Dec 23, 2024 •

edited

Loading