Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Push after running neuron_parallel_compile #354

Merged
merged 16 commits into from
Nov 24, 2023

Conversation

michaelbenayoun
Copy link
Member

@michaelbenayoun michaelbenayoun commented Nov 24, 2023

This PR adds support for pushing to the cache repository on the Hugging Face Hub after using neuron_paralell_compile.

Basically, when running a training after doing neuron_parallel_compile, all the compilation files will be pushed to the Hub before starting actual training.

cc @5cp

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

optimum/neuron/utils/cache_utils.py Outdated Show resolved Hide resolved
from .require_utils import requires_safetensors


logger = logging.get_logger()


def should_log() -> bool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very specific to neuronx_distributed, and not really related to logging. Can you move it under the neuronx_distributed directory, and choose a name that clearly states that this verifies we are in the process of the first worker (that's what I understood).

Copy link
Member Author

@michaelbenayoun michaelbenayoun Nov 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it's linked to the overall torch.distributed, it used everywhere, like in the Trainer and everything. Will renaming it.

Copy link
Collaborator

@dacorvo dacorvo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe change the name, otherwise LGTM ...

@@ -45,7 +45,7 @@
logger = logging.get_logger()


def should_log() -> bool:
def should_current_worker_log() -> bool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still has nothing to do with log. I would call it is_main_worker, wdyt ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, doing it.

@michaelbenayoun michaelbenayoun merged commit a8ae151 into main Nov 24, 2023
@michaelbenayoun michaelbenayoun deleted the push_with_neuron_parallel_compile branch November 24, 2023 17:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants