User loss of work if the cluster change occurs in the middle of the epoch #98

aivanou · 2020-04-28T02:52:00Z

Description

Currently, when the cluster membership change occurs, the agent will kill all the workers that run users script, perform rank redistribution and spawn them again. Since there is neither feedback mechanism nor communication protocol between workers and agent, the user can lose computational work since the last checkpoint.

jnkl314 · 2021-03-19T14:36:07Z

Hi,
I'm running distributed training with torchelastic (thanks a lot for the amazing work btw!), and I have very long epochs. So any change in the number of workers (or when using preemptible nodes) results in large computation waste since last checkpoint.
Is there any update on this issue ? Or any hint to a workaround for now ?
Would it be possible to detect when a worker group is about to stop ?

aivanou mentioned this issue Apr 28, 2020

How to run elastically on kubernetes (nnodes vs worker replicas) #97

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

User loss of work if the cluster change occurs in the middle of the epoch #98

User loss of work if the cluster change occurs in the middle of the epoch #98

aivanou commented Apr 28, 2020

jnkl314 commented Mar 19, 2021

User loss of work if the cluster change occurs in the middle of the epoch #98

User loss of work if the cluster change occurs in the middle of the epoch #98

Comments

aivanou commented Apr 28, 2020

Description

jnkl314 commented Mar 19, 2021