-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Basic implementation for mlos_benchd service #949
base: main
Are you sure you want to change the base?
Conversation
@@ -159,6 +175,11 @@ def __init__( # pylint: disable=too-many-arguments | |||
) | |||
self._description = description | |||
self._opt_targets = opt_targets | |||
self._ts_start = ts_start or datetime.now(UTC) | |||
self._ts_end: datetime | None = None | |||
self._status = Status.PENDING |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should match what was stored in the backend for resumable Experiments, right?
@@ -209,6 +230,9 @@ def _setup(self) -> None: | |||
|
|||
This method is called by `Storage.Experiment.__enter__()`. | |||
""" | |||
self._status = Status.RUNNING |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add some asserts on expected status to check for invalid state transitions.
Wouldn't be terrible to document the expected state transitions in a README.md or docstring either.
@@ -209,6 +230,9 @@ def _setup(self) -> None: | |||
|
|||
This method is called by `Storage.Experiment.__enter__()`. | |||
""" | |||
self._status = Status.RUNNING | |||
self._driver_name = platform.node() | |||
self._driver_pid = os.getpid() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These seem ok to initialize the values from None
, but might be problematic for resuming an Experiment.
There are a few cases I can think of:
- An individual
mlos_bench
driver process fails and needs to be restarted. - The
mlos_benchd
service dies, but themlos_bench
process is still running. - The whole driver host fails and either needs to be restarted on that same backend or else picked up by a new one (for simplicity, let's assume we only support the former for now. We can add a heartbeat mechanism later to support the latter).
) | ||
self._engine = engine | ||
self._schema = schema | ||
|
||
def _setup(self) -> None: | ||
super()._setup() | ||
def _ensure_persisted(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def _ensure_persisted(self) -> None: | |
def _save(self) -> None: |
or
def _ensure_persisted(self) -> None: | |
def _try_save(self) -> None: |
or
def _ensure_persisted(self) -> None: | |
def _persist(self) -> None: |
def _setup(self) -> None: | ||
super()._setup() | ||
self._ensure_persisted() | ||
with self._engine.begin() as conn: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might need to separate that out to an _update
method, or just incorporate it into _save
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also rename _save
to _create
to separate the INSERT
from UPDATE
calls
except exc.SQLAlchemyError: | ||
# probably a conflict | ||
trans.rollback() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
|
||
if __name__ == "__main__": | ||
parser = argparse.ArgumentParser(description="mlos_benchd") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a long
text description, similar to mlos_bench.launcher
help="Number of workers to use. Default is 1.", | ||
) | ||
parser.add_argument( | ||
"--poll_interval", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also provide poll-interval
variants, like mlos_bench.launcher
"--environment", | ||
root_env_config, | ||
"--experiment_id", | ||
exp_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll eventually need to include other things here too:
- cli config
- globals (could be subsumed by cli config)
- working dir (maybe we adjust the storage backend to store the target directory and cli-config and optionally globals instead of root_env_config)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great start! Thanks so much ❤️
I left a few comments for initial changes.
We'll also need tests.
This script is responsible for polling the storage for runnable experiments and | ||
executing them in parallel. | ||
|
||
See the current ``--help`` `output for details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the current ``--help`` `output for details. | |
See the current ``--help`` output for details. |
""" | ||
mlos_bench background execution daemon. | ||
|
||
This script is responsible for polling the storage for runnable experiments and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This script is responsible for polling the storage for runnable experiments and | |
This script is responsible for polling the :py:mod:`~mlos_bench.storage` for runnable :py:class:`.Experiment`s and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor tweaks to help make all of the docstring generation cross referencing. Might need some tweaks.
default=1, | ||
help="Polling interval in seconds. Default is 1.", | ||
) | ||
_main(parser.parse_args()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For testing, may also want some sort of hidden argument or environment variable used to set the MAX_ITERATIONS
.
Pull Request
Basic implementation for mlos_benchd service
This PR introduces:
Experiment
via the storage API, separate from running via CLI.mlos_benchd
script that polls storage for runnable experiments, then executes them.This allows for the separation of experiment creation (e.g. scheduling), and their execution (on potentially multiple hosts).
The new
mlos_benchd
script can run on any host to poll for new experiments, as long as they monitor the right storage backend.Builds upon schema changes from #931
Example usage
Create new experiment, via notebook or any python environment:
mlos_benchd:
Limitations:
Description
Type of Change
Indicate the type of change by choosing one (or more) of the following:
Testing
Manual testing, unit tests to come.
Additional Notes (optional)
Add any additional context or information for reviewers.