mlos_bench_service #732

bpkroth · 2024-05-10T16:37:19Z

(Storage) APIs to
- create new Experiments
  - how to get/update configs? git URI?
- set Experiment to runnable
  - daemon to start mlos_bench process as child
- set Experiment to stopped
  - stop any existing runner
- see also Add a storage API to allow a user to enqueue a new trial for an experiment from a notebook #687
new script (mlos_benchd) to manage those actions
- it would run in a tight loop on the "runner VM(s)"
- as Experiments become runnable in the queue, it would create an mlos_bench process for them and monitor the child process, changing that Experiment state in the database as necessary corresponding to the child process exit code
notifications on errors and/or monitoring dashboard on Experiment status, interacting mostly with the Storage APIs
- see also mlos_bench: error handling improvements #523

The text was updated successfully, but these errors were encountered:

bpkroth · 2024-05-10T16:52:28Z

bpkroth · 2024-05-10T16:52:40Z

May want to split some of these tasks out to separate issues later on

yshady · 2024-08-02T20:10:25Z

@bpkroth should be easy I have examples of interactive notebooks in my internal repo as well as the streamlit being quite a transferable process to a notebook experience!

yshady · 2024-08-02T20:11:20Z

can follow similar workflow as the sidemenu

yshady · 2024-08-02T20:11:45Z

Requirement would be a user can run at least one experiment manually first

eujing · 2024-10-03T21:23:24Z

From the mysql side, we currently have something similar that I can work on generalizing. It is basically a FastAPI app (we have been calling it a "runner") with the following endpoints.

Experiment-related:

GET /experiments -> Listing experiments. A combination of docker ps --filter "name=mlos-experiment-" and listing generated experiment config files
POST /experiments/start -> JSON body (and associated pydantic model for validation) to create experiment config files, and essentially does docker run {mlos_bench image} with relevant arguments
POST /experiments/stop/{experiment_id} -> No body, but essentially does docker stop {image name}

Front-end related, mainly for populating the JSON body to POST /experiments/start:

GET /options/mlos_configs -> List CLI config files for use with mlos_bench --config's value.
GET /options/benchbase_configs -> List benchbase XML config files
GET /options/tunable_groups -> List tunable group names, for selection to include in an experiment
GET /options/client_skus -> List available client VM SKUs for the subscription (makes an az API call)
GET /options/server_skus -> List available server SKUs (read off a curated CSV file)

We have 3 docker images, for this runner, the dashboard that uses it, and mlos_bench itself.
The first two are started with docker compose, along with nginx to help with some MSAL auth and HTTPS access to the dashboard.
Mounts for the runner container:

The base directory of the repo for our MLOS configs. This is also mounted to each "child" mlos_bench container that is started, including the relevant generated experiment config file.
The host's docker socket, to allow management of multiple mlos_bench containers.

yshady · 2024-10-03T21:54:19Z

Im really glad you guys are still using FastAPI from summer, and even building it out. Happy to see it!

bpkroth · 2024-10-03T22:29:11Z

Portions of this make sense, but I'd rather have it do more iteraction with the storage backend, particularly on the runner side of things.

Right now there's basically an assumption that POST /experiments/start/experiment_id directly invokes a docker run {mlos_bench image} but that will cause scaling issues, especially if training the model remains local to the mlos_bench process as it is now.

If instead, the POST /experiments/start/experiment_id simply changes the state of the Experiment to "Runnable", then any number of Runners polling the storage backend can attempt to grab run a transaction to grab the Experiment, assign itself to it, and change it's state to "Running", and then invoke it. If the transaction fails, then it can retry the polling operation and either see that another Runner "won" and started the Experiment are else something failed.

With that change, then all of the REST operations can happen on the frontend(s) (which can also be more than one), and all of the execution operations can happen elsewhere.

The Storage layer becomes the only source of truth and everything else can scale by communicating with it.

Also note that the frontends could continue to be notebooks in this case as well.

It basically frees us to implement different things in the web UI (#838).

Does it make sense?

yshady · 2024-10-03T22:37:01Z

"Also note that the frontends could continue to be notebooks in this case as well."

This is very true. Most of the code can be directly used in a notebook in the same way. I prefer a clean frontend but I know some people who are super hacky and sciency will stick to notebooks. We've discussed this a lot this summer.

Anyways this is great stuff, made my day that the team(s) are building on my work from summer, democratizing autotuning will probably mean turning MLOS into a simple frontend web application, but this is again just my opinion.

eujing · 2024-10-03T23:02:28Z

Portions of this make sense, but I'd rather have it do more iteraction with the storage backend, particularly on the runner side of things.

Right now there's basically an assumption that POST /experiments/start/experiment_id directly invokes a docker run {mlos_bench image} but that will cause scaling issues, especially if training the model remains local to the mlos_bench process as it is now.

If instead, the POST /experiments/start/experiment_id simply changes the state of the Experiment to "Runnable", then any number of Runners polling the storage backend can attempt to grab run a transaction to grab the Experiment, assign itself to it, and change it's state to "Running", and then invoke it. If the transaction fails, then it can retry the polling operation and either see that another Runner "won" and started the Experiment are else something failed.

With that change, then all of the REST operations can happen on the frontend(s) (which can also be more than one), and all of the execution operations can happen elsewhere.

The Storage layer becomes the only source of truth and everything else can scale by communicating with it.

I see, I think I understand. In this case, would we queue up experiments into storage via its API in the "Runnable" state, and a pool of runners would treat this as a queue in some order to invoke experiments, for scalability across multiple runners.

This would require us to store all the information needed to execute an experiment into the storage for runners to access. The current schema for ExperimentData has git_repo and git_commit fields, so the data flow of using git to access the configs seems the most direct.

My issue with this is having to push potentially sensitive experiment parameters (usually in the global files) to a git repo to run an experiment (e.g. connection string info) Should we consider expanding the ExperimentData schema to serialize this data?

I could imagine a first pass might be adding JSON fields representing values for mlos_bench --config <config_json> --globals <globals_json> <maybe key values for other simpler cli args>

…icrosoft#732

@eujing

# Pull Request ## Title Schema changes for mlos_benchd service. ______________________________________________________________________ ## Description Schema changes for mlos_benchd service. Storage APIs to adjust these to come in a future PR. - See #732 ______________________________________________________________________ ## Type of Change - ✨ New feature ______________________________________________________________________ ## Testing Local, CI ______________________________________________________________________ ## Additional Notes (optional) @eujing have a look at the [commit history in the PR](https://github.com/microsoft/MLOS/pull/931/commits) for a sense of what's going on. This can probably be merged as is. Happy to discuss further though. ______________________________________________________________________

eujing · 2025-01-24T01:01:45Z

@bpkroth @motus based on #931, I have been trying to work out a minimal set of changes for:

Registering new experiments: This looks possible via the existing Storage.experiment method, with some extensions for specifying the new details like ts_start
Polling for an experiment ready to run: Following the pseudo-code in Schema changes example for mlos_benchd service #931 makes sense
Launching a given experiment with the right 'CLI options'

I was thinking 1, 2, and a basic version of 3 can be done in the same PR, and would only be testable by a very minimal experiment with no CLI options. Essentially running with default CLI options, only specifying environment.

For 3. probably in a separate PR, I feel there has to be additional information that we might want to extend the Experiment schema with. We briefly discussed this about global overrides in the above PR.

Right now, the only info stored from there in Experiment is the environment or root_env_config.
Important ones we probably will need for meaningful experiments are:

config_path
services
optimizer
storage
globals

Given how these are all CLI options, maybe it makes sense to break these off into a separate table or JSON field in Experiment for flexibility. However it becomes abit inconsistent that environment is part of Experiment.

These values are all usually file paths, so maybe we can discuss further later on how they would actually be made available. But for now we can assume they are already available on the VM mlos_benchd is executing in.

Does this approach make sense?

bpkroth mentioned this issue May 23, 2024

Allow Environment configs to report the metrics they expose #752

Open

bpkroth mentioned this issue Jul 12, 2024

Mlos parallelization: storage schema initialization fails when attempted by many processes concurrently #793

Open

bpkroth mentioned this issue Aug 12, 2024

mlos_webui design requirements #838

Open

9 tasks

bpkroth added a commit to bpkroth/MLOS that referenced this issue Jan 14, 2025

Adding additional columns to help support mlos_benchd_service - See m…

bdb5de1

…icrosoft#732

bpkroth mentioned this issue Jan 14, 2025

Schema changes example for mlos_benchd service #931

Merged

bpkroth mentioned this issue Jan 29, 2025

Rethink packaging: split mlos_bench.storage out? #947

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mlos_bench_service #732

mlos_bench_service #732

bpkroth commented May 10, 2024 •

edited

Loading

bpkroth commented May 10, 2024

bpkroth commented May 10, 2024

yshady commented Aug 2, 2024

yshady commented Aug 2, 2024 •

edited

Loading

yshady commented Aug 2, 2024

eujing commented Oct 3, 2024 •

edited

Loading

yshady commented Oct 3, 2024

bpkroth commented Oct 3, 2024 •

edited

Loading

yshady commented Oct 3, 2024 •

edited

Loading

eujing commented Oct 3, 2024 •

edited

Loading

eujing commented Jan 24, 2025

mlos_bench_service #732

mlos_bench_service #732

Comments

bpkroth commented May 10, 2024 • edited Loading

bpkroth commented May 10, 2024

bpkroth commented May 10, 2024

yshady commented Aug 2, 2024

yshady commented Aug 2, 2024 • edited Loading

yshady commented Aug 2, 2024

eujing commented Oct 3, 2024 • edited Loading

yshady commented Oct 3, 2024

bpkroth commented Oct 3, 2024 • edited Loading

yshady commented Oct 3, 2024 • edited Loading

eujing commented Oct 3, 2024 • edited Loading

eujing commented Jan 24, 2025

bpkroth commented May 10, 2024 •

edited

Loading

yshady commented Aug 2, 2024 •

edited

Loading

eujing commented Oct 3, 2024 •

edited

Loading

bpkroth commented Oct 3, 2024 •

edited

Loading

yshady commented Oct 3, 2024 •

edited

Loading

eujing commented Oct 3, 2024 •

edited

Loading