DISCLAIMER: This infrastructure was put together in late 2020, when many DVC features around experiments had not yet been introduced. As such, some parts of it are likely obsolete and can be replaced with much simpler calls to DVC itself (e.g. dvc exp branch
and dvc exp push
). The utility-scripts
folder in particular can probably be thrown in the trash.
- Write a CloudFormation template with an ECR repo, Batch job definition, and any other AWS resources needed for the new project. See
cloudformation.example-project.yml
for an example.- The ECR repo should be called
acme/$REPO_NAME
, where $REPO_NAME is the name of the Bitbucket repo. - Similarly, the job definition should be called
acme-${Environment}-ml-${REPO_NAME}
. Make sure the specify sane defaults for the compute resources.
- The ECR repo should be called
- Deploy the CloudFormation stack
- Give the acme-machine Bitbucket user write permission on the new repo.
- Create a new webhook for the repo:
- URL: the
/bitbucket-webhook
path of the API deployed as part of this repo's production CloudFormation stack - Enable request history collection. It'll be useful for the next step
- Triggers: "Repository push" should be fine for most use cases
- URL: the
- Bitbucket should now fire off a "hello world" request to the API (click View Requests). It'll be denied because the webhook is not authorized. Open the request details and get the value of the "X-Hook-UUID" header of the request.
- Go to AWS SecretsManager and add the webhook UUID to the list of authorized UUIDs in the "acme/production/bitbucket-webhook-uuids" secret.
- Wait a few minutes for AWS to dispose of the old Lambda container (otherwise it'll have the old list of authorized webhooks).
- Hit retry or push a new commit and watch the build happen
To trigger a build in the cloud, make and push a commit with one or more of the below keywords in the message. Each keyword must be surrounded by colons (:) and be at the start of its own line. For more information on specifying resource requirements, see the AWS Batch documentation.
Keyword | Arguments | Description |
---|---|---|
ACME_RUN | OPTIONAL | Triggers the job, and passes arguments to dvc repro |
ACME_RUN_EXP | NO | Run dvc exp run instead of dvc repro |
ACME_GPUS | OPTIONAL | The number of GPUs for the job, or 1 of no argument is provided |
ACME_MEMORY | YES | The amount of RAM (in MiB) for the job. Requires an argument |
ACME_VCPUS | YES | The number of vCPUs for the job. Requires an argument |
Trigger a job for a single pipeline stage with no GPUs but a lot of memory
Added new features to data
:ACME_RUN: --single-item featurize_data
:ACME_GPUS: 0
:ACME_MEMORY: 32768
Trigger an experiment with 1 GPU (the default if no argument is passed)
Hyperparameter scanning experiment
:ACME_RUN: --params learning_rate=0.01
:ACME_RUN_EXP:
:ACME_GPUS: