Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Neuron model NEFFs are dependent on the python path #99

Open
dacorvo opened this issue Oct 7, 2024 · 2 comments
Open

Neuron model NEFFs are dependent on the python path #99

dacorvo opened this issue Oct 7, 2024 · 2 comments

Comments

@dacorvo
Copy link

dacorvo commented Oct 7, 2024

The same bug that was present in AWS Neuron SDK 2.19 and fixed in 2.19.1 (#91) is back in AWS Neuron SDK 2.20.

With AWS Neuron SDK 2.19, when exporting a model and saving the compiled artifacts, it is impossible to reload them afterwards if the python path is different.

This basically makes shared serialization and caching impossible, since you cannot control the deployment environment (ec2 with DLAMI, sagemaker or ad-hoc end-user endpoints will all have different environments).

Steps to reproduce

  1. download test_tnx_llama_export.py

  2. export the model in a venv

$ python3 -m venv foo_venv
$ source foo_venv/bin/activate
$ export PIP_EXTRA_INDEX_URL=https://pip.repos.neuron.amazonaws.com
$ python - m pip install -U neuronx-cc torch_neuronx==2.* transformers-neuronx
$ python test_tnx_llama_export.py export meta-llama/Llama-3.1-8B-Instruct --save_dir ./llama-foo
  1. check the generated artifacts and verify the neuron model can be reloaded (no compilation should happen)
$ python test_tnx_llama_export.py run meta-llama/Llama-3.1-8B-Instruct --save_dir ./llama-foo
  1. deactivate the venv and try to reload the model in another venv
$ deactivate
$ python3 -m venv bar_venv
$ source bar_venv/bin/activate
$ export PIP_EXTRA_INDEX_URL=https://pip.repos.neuron.amazonaws.com
$ python -m pip install -U neuronx-cc torch_neuronx==2.* transformers-neuronx
$ python test_tnx_llama_export.py run meta-llama/Llama-3.1-8B-Instruct --save_dir ./llama-foo

You should get the following exception:

FileNotFoundError: Could not find a matching NEFF for your HLO in this directory. Ensure that the model you are trying to load is the same type and has the same parameters as the one you saved or call "save" on this model to reserialize it.
  1. export the model from the new venv
$ python test_tnx_llama_export.py export meta-llama/Llama-3.1-8B-Instruct --save_dir ./llama-bar

Now if you compare the NEFF files in the two save dir you will see that one of them is different.

@pagezyhf
Copy link

pagezyhf commented Oct 7, 2024

@jeffhataws as you helped working on this last time (2.19 to 2.19.1)?

@aws-patlange
Copy link
Contributor

Thank you for reporting this bug. We have reproduced and identified the issue and are working on a fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants