Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schema changes example for mlos_benchd service #931

Merged

Conversation

bpkroth
Copy link
Contributor

@bpkroth bpkroth commented Jan 14, 2025

Pull Request

Title

Schema changes for mlos_benchd service.


Description

Schema changes for mlos_benchd service.

Storage APIs to adjust these to come in a future PR.


Type of Change

  • ✨ New feature

Testing

Local, CI


Additional Notes (optional)

@eujing have a look at the commit history in the PR for a sense of what's going on.
This can probably be merged as is. Happy to discuss further though.


@bpkroth bpkroth requested a review from a team as a code owner January 14, 2025 17:14
@bpkroth bpkroth added the ready for review Ready for review label Jan 14, 2025
@bpkroth
Copy link
Contributor Author

bpkroth commented Jan 14, 2025

Rough idea here is in the comments in the schema.py file.

At a high level, imagine that we have 1+ VMs running mlos_benchd as a service, continually polling the backend storage to see if there are new Experiments submitted that are PENDING.

Note: Here a "Worker" runs mlos_benchd to spawn multiple mlos_bench processes, one per Experiment. Where as within each mlos_bench 1+ TrialRunners may run multiple Trials for that Experiment in parallel (wip - #380).

A Worker starts a transaction to lay claim to an Experiment and then forks off a new mlos_bench process to run it in the background.

# quick pseudo code
while true:
  sleep(1)
  with conn.begin() as conn:
     try: 
       experiment_row = conn.select(schema.experiments)
           .where(
               schema.experiments.status == Status.PENDING.name,
               schema.experiments.worker_name is None,
               schema.experiments.start_ts >= datetime.utcnow(),
           )
           .limit(1)
      if experiment_row:
         # try to grab
         result = conn.update(schema.experiments)
           .values({
              schema.experiments.worker_name: sys.hostname,
              schema.experiments.status: status.READY.name,
           })
           .where(
              schema.experiments.worker_name is None,
              schema.experiments.experiment_id == experiment_row.experiment_id,
            )
         if result:
           # succeeded, commit the transaction and return
           conn.commit()
           # return this to calling code to spawn a new `mlos_bench`
           # process to fork and execute this Experiment on this host
           # in the background
           return experiment_row.experiment_id
         else:
           # someone else probably grabbed it
           conn.rollback()
  except SqlException as e:
     # probably a conflict 
     conn.rollback()
# try again in a moment

@bpkroth bpkroth added ready for review Ready for review and removed ready for review Ready for review labels Jan 14, 2025
Copy link
Member

@motus motus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, but let's have a short discussion before merging it in

mlos_bench/mlos_bench/storage/sql/schema.py Show resolved Hide resolved
@bpkroth bpkroth enabled auto-merge (squash) January 17, 2025 00:06
@bpkroth bpkroth disabled auto-merge January 17, 2025 00:13
@bpkroth bpkroth enabled auto-merge (squash) January 17, 2025 00:13
@bpkroth bpkroth merged commit 0cab884 into microsoft:main Jan 17, 2025
16 checks passed
@bpkroth bpkroth deleted the schema-changes-example-for-mlos-benchd-service branch January 17, 2025 00:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready for review Ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants