Save function fix #329

michaelbenayoun · 2023-11-16T14:28:05Z

This PR aims at fixing the save function.

Since transformers==4.35.0 the default saving function being used is the safetensors.torch.save_file one. This PR patches the saving mechanism to account for that.

HuggingFaceDocBuilderDev · 2023-11-16T14:31:07Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

JingyaHuang

LGTM, thanks for the fix!

I just left some very minor nits, mostly because that I want to understand better.

JingyaHuang · 2023-12-01T11:46:28Z

optimum/neuron/trainers.py

                else:
                    logger.info("Trainer.model is not a `PreTrainedModel`, only saving its state dict.")
                    state_dict = self.model.state_dict()
                    xm.save(state_dict, os.path.join(output_dir, WEIGHTS_NAME))
            else:
-                self.model.save_pretrained(output_dir, is_main_process=self.args.should_save, save_function=xm.save)
+                with safe_save_function_patcher:


What is self.model under this case? If unwrap_model(self.model) is not an instance of PreTrainedModel, can it still apply save_pretrained()?

self.model is the model we are currently training and is an instance of PreTrainedModel in this case.

No, if unwrap_model(self.model) is not an instance of PreTrainedModel we cannot call save_pretrained().

JingyaHuang · 2023-12-01T11:52:52Z

optimum/neuron/utils/cache_utils.py

@@ -133,14 +133,22 @@ def create_custom_cache_repo(repo_id: str = CACHE_REPO_NAME, private: bool = Tru


 def is_private_repo(repo_id: str) -> bool:
+    """Tells whether `repo_id` is private."""


Does this function check whether a repo is private to all public? If so why do we need to check if it's private to a particular user in advance?

This function checks whether is private to the general public but public to the current user.

optimum/neuron/utils/cache_utils.py

JingyaHuang · 2023-12-01T11:58:56Z

optimum/neuron/utils/training_utils.py

+    global_master: bool = False,
+):
+    """
+    Torch XLA compatible implementation of `safetensors.torch.save_file`.


Could you elaborate and explain those arguments a bit? I am not familiar with what master_only and global_master indicate.

So those two parameters are related to distributed training.
Basically, when master_only is True, only the master rank will be saving the file instead of all ranks, and global_master controls whether only the global master (accross multiple nodes) should be saving or the master of each node.

More information here

dacorvo · 2023-12-01T12:51:05Z

optimum/neuron/trainers.py

-                        save_function=xm.save,
-                    )
+                    with safe_save_function_patcher:
+                        unwrap_model(self.model).save_pretrained(


Could you explain what you want to avoid/achieve here ?

Could you explain what will the exact chain of calls be here ?

To be more specific, why could not you just continue with the typical transformers paradigm: pass the correct flag to tag the main process and give the specific save function (looking at the patch it is not clear why it is not redundant).

So basically the chain of calls here is:

We enter a context manager that patches the transformers.modeling_utils.safe_save_file function which is safetensors.torch.save_file to be my torch_xla compatible version of this function.

The model is unwraped (it's not really important here, it's related to the Trainer and how sometimes the model is wrapped for features I am not really sure we support).

The model is saved, and since last Transformers release, it will save the checkpoint using safetensors. The issue is that when safetensors is used, the save_function parameter is ignored. That is the reason why we do it like that instead of simply passing the torch_xla_safe_save_file as the value for save_function.

We exit the context manager and everything that was patched is restored to its original value.

Thanks for the explaination !

dacorvo

LGTM, thanks !

dacorvo · 2023-12-18T13:37:47Z

optimum/neuron/trainers.py

-                        save_function=xm.save,
-                    )
+                    with safe_save_function_patcher:
+                        unwrap_model(self.model).save_pretrained(


Thanks for the explaination !

michaelbenayoun added 2 commits November 16, 2023 15:03

Add a safetensors function compatible with

f2e7c99

Add custom _maybe_move_to_cpu

f03e9b9

michaelbenayoun added 3 commits November 27, 2023 14:48

Merge branch 'main' into fix_long_save

ca87dd4

Fix

aea8db1

Add docstring

36fcfc4

michaelbenayoun changed the title ~~Save function taking a lot of time~~ Save function fix Nov 27, 2023

michaelbenayoun marked this pull request as ready for review November 27, 2023 16:31

michaelbenayoun requested review from dacorvo and JingyaHuang November 27, 2023 16:31

michaelbenayoun added 2 commits November 28, 2023 11:15

Fix test

607ae21

Fix test

158ff4e

JingyaHuang approved these changes Dec 1, 2023

View reviewed changes

dacorvo reviewed Dec 1, 2023

View reviewed changes

Apply suggestions

ef9b15c

dacorvo approved these changes Dec 18, 2023

View reviewed changes

michaelbenayoun merged commit c2367ae into main Dec 18, 2023
7 checks passed

michaelbenayoun deleted the fix_long_save branch December 18, 2023 16:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save function fix #329

Save function fix #329

michaelbenayoun commented Nov 16, 2023

HuggingFaceDocBuilderDev commented Nov 16, 2023

JingyaHuang left a comment

JingyaHuang Dec 1, 2023

michaelbenayoun Dec 18, 2023

JingyaHuang Dec 1, 2023

michaelbenayoun Dec 18, 2023

JingyaHuang Dec 1, 2023

michaelbenayoun Dec 18, 2023

michaelbenayoun Dec 18, 2023

dacorvo Dec 1, 2023

dacorvo Dec 1, 2023

michaelbenayoun Dec 18, 2023 •

edited

Loading

dacorvo Dec 18, 2023

dacorvo left a comment

dacorvo Dec 18, 2023

		@@ -133,14 +133,22 @@ def create_custom_cache_repo(repo_id: str = CACHE_REPO_NAME, private: bool = Tru


		def is_private_repo(repo_id: str) -> bool:
		"""Tells whether `repo_id` is private."""

Save function fix #329

Save function fix #329

Conversation

michaelbenayoun commented Nov 16, 2023

HuggingFaceDocBuilderDev commented Nov 16, 2023

JingyaHuang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michaelbenayoun Dec 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dacorvo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michaelbenayoun Dec 18, 2023 •

edited

Loading