Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TRI-4519] Slow start times in Cloud #1685

Open
1 of 3 tasks
matt-aitken opened this issue Feb 9, 2025 · 1 comment
Open
1 of 3 tasks

[TRI-4519] Slow start times in Cloud #1685

matt-aitken opened this issue Feb 9, 2025 · 1 comment

Comments

@matt-aitken
Copy link
Member

matt-aitken commented Feb 9, 2025

In the Cloud service there have been slower than normal start times for the past few days.

This is caused by a combination of slow Docker image pull times and a large increase in executing runs. Before we can execute your code we need to get it onto the server so it can be run.

The pull times are slower for two reasons:

  1. The Digital Ocean Container registry has been slower. We're trying to get to the bottom of why this is the case.
  2. Our container caching system doesn't work well now we have a lot of worker servers. It relied on local caching on the server which gives very fast results. However there's limited disk space on the servers executing your code, so as the service has become more popular the cache hit ratio has gone down. We have had really significant growth in the past month that has meant this cache is now only good for customers doing a high volume of runs (so their images remain in the cache).

Solutions

  • Launch a cluster-wide Docker image cache. We will cache Docker images on dedicated servers with large disks inside the cluster. This will scale well with more servers and concurrent runs/orgs. After a deploy image has been pulled once (first run) it will remain as long as runs are happening occasionally (e.g. a few per day). It will be a Least Recently Used (LRU) cache. We expect this to get our run start times back to what they were a couple of weeks ago (on average a few seconds).
  • Run Engine 2 will ship later this month, which will bring warm starts. This will improve start times when a run has just finished and we can grab a queued run from the same deployment. Expect start times of 1 second or less in this situation. It will not speed up cold start times. This optimization is beneficial when you have executing runs AND runs that are queued, i.e. you're doing a lot of runs.
  • Super fast cold starts using CPU/memory snapshot/restore. We already use this technology for our wait functions at scale. This will work by snapshotting the running container as part of the deploy, just before it would start executing. Then we can do a fast restore of this snapshot for every run (even the very first run after a deploy). This will give consistent fast start times, but is significantly more complex and won't ship until at least April.

From SyncLinear.com | TRI-4519

@matt-aitken matt-aitken changed the title Slow start times in Cloud [TRI-4519] Slow start times in Cloud Feb 9, 2025
@matt-aitken
Copy link
Member Author

The Docker image cache has been live for several days now and has made a dramatic difference to pull times for run starts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant