Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic implementation for mlos_benchd service #949

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

eujing
Copy link
Contributor

@eujing eujing commented Feb 5, 2025

Pull Request

Basic implementation for mlos_benchd service

This PR introduces:

  • Creation of Experiment via the storage API, separate from running via CLI.
  • A new mlos_benchd script that polls storage for runnable experiments, then executes them.

This allows for the separation of experiment creation (e.g. scheduling), and their execution (on potentially multiple hosts).
The new mlos_benchd script can run on any host to poll for new experiments, as long as they monitor the right storage backend.

Builds upon schema changes from #931

Example usage

Create new experiment, via notebook or any python environment:

from mlos_bench.storage import from_config
storage = from_config(config="mlos_bench/mlos_bench/config/storage/sqlite.jsonc")
exp = storage.experiment(experiment_id="test-exp-01", trial_id=0, root_env_config="mlos_bench/mlos_bench/config/environments/apps/fake/test_local_env.jsonc", description="description", tunables={}, opt_targets={"score": "min"})
exp.save()

mlos_benchd:

/workspaces/MLOS (main) $ python mlos_bench/mlos_bench/mlos_benchd.py --storage "mlos_bench/mlos_bench/config/storage/sqlite.jsonc"
...
No runnable experiment found. Sleeping for 1 second.
No runnable experiment found. Sleeping for 1 second.
No runnable experiment found. Sleeping for 1 second.
...
2025-02-05 20:35:06,208 launcher.py:51 __init__ INFO Launch: mlos_bench
...
2025-02-05 20:35:08,270 base_scheduler.py:275 get_best_observation INFO Env: Local Shell Test Environment best score: {'score': 123.4}
2025-02-05 20:35:08,271 run.py:74 _main INFO Final score: {'score': 123.4}
No runnable experiment found. Sleeping for 1 second.

Limitations:

  • Assumes all experiment-related config files are already available on the host
  • Does not support per-experiment global overrides yet

Description


Type of Change

Indicate the type of change by choosing one (or more) of the following:

  • ✨ New feature

Testing

Manual testing, unit tests to come.


Additional Notes (optional)

Add any additional context or information for reviewers.


@@ -159,6 +175,11 @@ def __init__( # pylint: disable=too-many-arguments
)
self._description = description
self._opt_targets = opt_targets
self._ts_start = ts_start or datetime.now(UTC)
self._ts_end: datetime | None = None
self._status = Status.PENDING
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should match what was stored in the backend for resumable Experiments, right?

@@ -209,6 +230,9 @@ def _setup(self) -> None:

This method is called by `Storage.Experiment.__enter__()`.
"""
self._status = Status.RUNNING
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add some asserts on expected status to check for invalid state transitions.
Wouldn't be terrible to document the expected state transitions in a README.md or docstring either.

@@ -209,6 +230,9 @@ def _setup(self) -> None:

This method is called by `Storage.Experiment.__enter__()`.
"""
self._status = Status.RUNNING
self._driver_name = platform.node()
self._driver_pid = os.getpid()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These seem ok to initialize the values from None, but might be problematic for resuming an Experiment.

There are a few cases I can think of:

  1. An individual mlos_bench driver process fails and needs to be restarted.
  2. The mlos_benchd service dies, but the mlos_bench process is still running.
  3. The whole driver host fails and either needs to be restarted on that same backend or else picked up by a new one (for simplicity, let's assume we only support the former for now. We can add a heartbeat mechanism later to support the latter).

)
self._engine = engine
self._schema = schema

def _setup(self) -> None:
super()._setup()
def _ensure_persisted(self) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _ensure_persisted(self) -> None:
def _save(self) -> None:

or

Suggested change
def _ensure_persisted(self) -> None:
def _try_save(self) -> None:

or

Suggested change
def _ensure_persisted(self) -> None:
def _persist(self) -> None:

def _setup(self) -> None:
super()._setup()
self._ensure_persisted()
with self._engine.begin() as conn:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might need to separate that out to an _update method, or just incorporate it into _save

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also rename _save to _create to separate the INSERT from UPDATE calls

except exc.SQLAlchemyError:
# probably a conflict
trans.rollback()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change



if __name__ == "__main__":
parser = argparse.ArgumentParser(description="mlos_benchd")
Copy link
Contributor

@bpkroth bpkroth Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a long text description, similar to mlos_bench.launcher

help="Number of workers to use. Default is 1.",
)
parser.add_argument(
"--poll_interval",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also provide poll-interval variants, like mlos_bench.launcher

"--environment",
root_env_config,
"--experiment_id",
exp_id,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll eventually need to include other things here too:

  • cli config
  • globals (could be subsumed by cli config)
  • working dir (maybe we adjust the storage backend to store the target directory and cli-config and optionally globals instead of root_env_config)

Copy link
Contributor

@bpkroth bpkroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great start! Thanks so much ❤️

I left a few comments for initial changes.

We'll also need tests.

This script is responsible for polling the storage for runnable experiments and
executing them in parallel.

See the current ``--help`` `output for details.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
See the current ``--help`` `output for details.
See the current ``--help`` output for details.

"""
mlos_bench background execution daemon.

This script is responsible for polling the storage for runnable experiments and
Copy link
Contributor

@bpkroth bpkroth Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This script is responsible for polling the storage for runnable experiments and
This script is responsible for polling the :py:mod:`~mlos_bench.storage` for runnable :py:class:`.Experiment`s and

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor tweaks to help make all of the docstring generation cross referencing. Might need some tweaks.

default=1,
help="Polling interval in seconds. Default is 1.",
)
_main(parser.parse_args())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For testing, may also want some sort of hidden argument or environment variable used to set the MAX_ITERATIONS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants