Skip to content
This repository has been archived by the owner on Jan 6, 2023. It is now read-only.

User loss of work if the cluster change occurs in the middle of the epoch #98

Open
aivanou opened this issue Apr 28, 2020 · 1 comment

Comments

@aivanou
Copy link
Contributor

aivanou commented Apr 28, 2020

Description

Currently, when the cluster membership change occurs, the agent will kill all the workers that run users script, perform rank redistribution and spawn them again. Since there is neither feedback mechanism nor communication protocol between workers and agent, the user can lose computational work since the last checkpoint.

@jnkl314
Copy link

jnkl314 commented Mar 19, 2021

Hi,
I'm running distributed training with torchelastic (thanks a lot for the amazing work btw!), and I have very long epochs. So any change in the number of workers (or when using preemptible nodes) results in large computation waste since last checkpoint.
Is there any update on this issue ? Or any hint to a workaround for now ?
Would it be possible to detect when a worker group is about to stop ?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants