-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for the execution policy API in JobSet #672
Comments
/kind feature |
SGTM! (The description shouldn’t be a design but I think that we don’t need suspended. And you also declare ReadyStatus for the succeeded state) |
This would be great! I agree to not try to create a workflow tool, but allowing this basic dependency structure is something that workflow tools can use. +1 from me. |
That sounds like a really cool feature! I'm curious about something though. Currently, in the Jobset, we have multiple policies like StartupPolicy and FailurePolicy. As we keep adding more features to our policies, we might to think about whether we should make them work together or if we need to set some restrictions.(for example, can StartupPolicy and ExecutionPolicy coexist) This way, users can avoid making a lot of mistakes when using them. |
I think the feature makes sense; executing "Initializer -> Trainer -> Post-Processor" stages sequentially for LLM fine-tuning is an interesting concrete use case. @andreyvelich can you educate me on why the fine-tuning steps must be decomposed into separate jobs? Why can't the same job perform initialization, training, then post-processing?
@googs1025 we can add validation to the JobSet webhook which ensures the spec configurations are compatible with eachother. |
@danielvegamyhre Sure. Given the size of today's models (e.g. >100b params) and datasets, the download time for pre-trained model and dataset takes time, e.g. 1-2 hours. If we define this initialization steps as |
The current JobSet APi is alpha. So, I think it would be great to consider how we can cooperate some policies and how we can consolidate some fields into one once we graduate to beta stage. This is one of the reasons why we keep parking the alpha stage, as we discussed in the previous meeting. |
If the goal is not to look the GPUs during other stages, it is important to note what your expectations are for the future when queuing in Kueue.
My concern is where do we stop. You add this feature now, then someone comes and asks "what if we add X", and suddenly we have a workflow tool. |
Good point. Actually, I guess that the JobSet already has the same problem in the StartupPolicy. The problem is limited compared with executionPolicy, though. So, I'm wondering if we should investigate a solid approach to support the Jobs with ordering and steps.
I have the same concern. So, if we introduce this feature, I think we need to consider API to prevent conditional parameters.
|
I think that the separate "suspend" field and 3 different objects are hard to maintain and not straightforward. |
+1
Are you referring to suspending each of the 3 jobs ( Initializer job, Trainer job, Post-Processor job) and resuming them one at a time as the prior job's execution completes? |
You are starting to sound like me! The answer is that we stop there with a "depends on." If you look at workload managers, it's fairly common to be able to say:
And that's it. There is no further orchestration, suspend or waiting, etc. It's just a flat level depends on that can provide a simple API for actual workflow tools (that represent a dag) to use. |
You are saying that you want to support arbitrary DAGs already, whereas the initial proposal is just to support a sequence of jobs. That's the kind of leaps I was rather against. An arbitrary DAG is already a workflow manager. Why aren't we using a workflow manager as opposed to add workflow capabilities to jobset? |
Where did I say that? |
Unless I misinterpreted what you were trying to say. |
Depends on (in and of itself) does not create what most would consider substantial enough for a workflow DAG. It provides a minimal API in the workload manager so that workflow tools can more easily create complex logic.
The workload manager has no knowledge of checking state, submission, or custom action logic beyond the very simple "A depends on B." It's not really enough to be called a workflow but it enables workflow tools to better interact with it. I am absolutely not saying that I want to support arbitrary DAGs. I am pointing out a common pattern in the workload manager ecosystem for HPC that has existed for decades, and made it possible for workflow tools to better integrate. |
Do you have a minimal proposal for what jobset could provide? Are you saying that you want to say "jobset X depends on jobset Y"? I ask that because the current proposal is along the lines of "this jobset has multiple steps within it, each of which can be started after the other". Those are two very different approaches. |
No I don’t think depends on would fall on the level of the jobset, more likely the job. |
I think having the "depends on" relationship operate on the ReplicatedJob level may be better - for example, for something like the "initializer -> trainer -> post-processor" use case described in this issue, I think we may want to have each stage be a separate ReplicatedJob (since for example, the "trainer" stage may be composed of several concurrent jobs running on different groups of infrastructure, but they all depend on the "initializer" stage being completed). |
Exactly, that is what I had in mind when I suggested this in the slack channel, similar to StartupPolicy which is executed at the ReplicatedJob level too. ExecutionPolicy and StartupPolicy are similar, but one is based on readiness status, the other based on completion status. |
also, the suggestion here is not to have an API to explicitly define the chain, it is based on the order in the replicatedJob list, again similar to StartupPolicy. |
Awesome, well if we are all aligned on this I can implement it. |
The only thing we probably want to be configured is how many entries to run sequentially. The default (expressed as nil) is all, but the user could set it to 1 for example to run the first entry (does some initialization for the whole workload) and the rest of entries can execute in parallel. |
How should we calculate quota usage for such a sequence? |
Yes, this proposal is only for Job sequence, and as @vsoch mentioned this concept exists in HPC space for a long time.
@ahg-g @danielvegamyhre Do we need a small KEP for it or we can jump directly to the implementation ?
@alculquicondor I think, in the first iteration we should calculate the quota as for normal JobSet (e.g. admin the JobSet if all replicated Jobs resources are available). However, it would be nice to admin/suspend steps in Job sequence separately. |
We should have a small KEP for this. |
I agree that we can leave implementation details for a second iteration. However, there are too many moving pieces here to completely ignore the question until later. We at least need a high level plan of how it's going to work.
Not really. We need to figure out if the proposed design for Argo will also work for Jobset. I don't think it currently does, as the proposal relies on the pod group integration, which wouldn't be ideal for jobset. |
/assign @andreyvelich |
What would you like to be added:
The new
executionPolicy
API which allows to submit replicated Jobs in order.When the first replicated Jobs are the reached required condition, the next replicated Jobs are created.
Note. The complex DAG workflow capability is out of scope of this API, since we don't want to implement workflow functionality as part of this KEP. Users should consider to use Argo Workflows or Tekton Pipelines if they need it.
The initial API design:
Why is this needed:
More context in this Kubernetes
wg-batch
thread: https://kubernetes.slack.com/archives/C032ZE66A2X/p1725400839102729As part of the Kubeflow Training V2 APIs, we want to implement the LLM runtimes for LLMs fine-tuning: kubeflow/training-operator#2170
That will require JobSet to orchestrate the sequence of 2-3 Jobs: Initializer -> Trainer -> Post-Processor.
The capacity management for such workload should be allocated for all Jobs combined and be controlled by Kueue.
When
TrainJob
is suspended, we will suspend all underlying Jobs.I think, we might have more use-cases from the HPC side. Any thoughts @vsoch @alculquicondor ?
This enhancement requires the following artifacts:
The artifacts should be linked in subsequent comments.
cc @tenzen-y @kannon92 @ahg-g @johnugeorge @akshaychitneni @shravan-achar
The text was updated successfully, but these errors were encountered: