-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Push after running neuron_parallel_compile
#354
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
optimum/neuron/utils/misc.py
Outdated
from .require_utils import requires_safetensors | ||
|
||
|
||
logger = logging.get_logger() | ||
|
||
|
||
def should_log() -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very specific to neuronx_distributed, and not really related to logging. Can you move it under the neuronx_distributed directory, and choose a name that clearly states that this verifies we are in the process of the first worker (that's what I understood).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it's linked to the overall torch.distributed
, it used everywhere, like in the Trainer and everything. Will renaming it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe change the name, otherwise LGTM ...
optimum/neuron/utils/misc.py
Outdated
@@ -45,7 +45,7 @@ | |||
logger = logging.get_logger() | |||
|
|||
|
|||
def should_log() -> bool: | |||
def should_current_worker_log() -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It still has nothing to do with log. I would call it is_main_worker, wdyt ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, doing it.
This PR adds support for pushing to the cache repository on the Hugging Face Hub after using
neuron_paralell_compile
.Basically, when running a training after doing
neuron_parallel_compile
, all the compilation files will be pushed to the Hub before starting actual training.cc @5cp