You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jan 6, 2023. It is now read-only.
Currently, when the cluster membership change occurs, the agent will kill all the workers that run users script, perform rank redistribution and spawn them again. Since there is neither feedback mechanism nor communication protocol between workers and agent, the user can lose computational work since the last checkpoint.
The text was updated successfully, but these errors were encountered:
Hi,
I'm running distributed training with torchelastic (thanks a lot for the amazing work btw!), and I have very long epochs. So any change in the number of workers (or when using preemptible nodes) results in large computation waste since last checkpoint.
Is there any update on this issue ? Or any hint to a workaround for now ?
Would it be possible to detect when a worker group is about to stop ?
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Description
Currently, when the cluster membership change occurs, the agent will kill all the workers that run users script, perform rank redistribution and spawn them again. Since there is neither feedback mechanism nor communication protocol between workers and agent, the user can lose computational work since the last checkpoint.
The text was updated successfully, but these errors were encountered: