Neuron model NEFFs are dependent on the python path #99

dacorvo · 2024-10-07T10:24:01Z

The same bug that was present in AWS Neuron SDK 2.19 and fixed in 2.19.1 (#91) is back in AWS Neuron SDK 2.20.

With AWS Neuron SDK 2.19, when exporting a model and saving the compiled artifacts, it is impossible to reload them afterwards if the python path is different.

This basically makes shared serialization and caching impossible, since you cannot control the deployment environment (ec2 with DLAMI, sagemaker or ad-hoc end-user endpoints will all have different environments).

Steps to reproduce

download test_tnx_llama_export.py
export the model in a venv

$ python3 -m venv foo_venv
$ source foo_venv/bin/activate
$ export PIP_EXTRA_INDEX_URL=https://pip.repos.neuron.amazonaws.com
$ python - m pip install -U neuronx-cc torch_neuronx==2.* transformers-neuronx
$ python test_tnx_llama_export.py export meta-llama/Llama-3.1-8B-Instruct --save_dir ./llama-foo

check the generated artifacts and verify the neuron model can be reloaded (no compilation should happen)

$ python test_tnx_llama_export.py run meta-llama/Llama-3.1-8B-Instruct --save_dir ./llama-foo

deactivate the venv and try to reload the model in another venv

$ deactivate
$ python3 -m venv bar_venv
$ source bar_venv/bin/activate
$ export PIP_EXTRA_INDEX_URL=https://pip.repos.neuron.amazonaws.com
$ python -m pip install -U neuronx-cc torch_neuronx==2.* transformers-neuronx
$ python test_tnx_llama_export.py run meta-llama/Llama-3.1-8B-Instruct --save_dir ./llama-foo

You should get the following exception:

FileNotFoundError: Could not find a matching NEFF for your HLO in this directory. Ensure that the model you are trying to load is the same type and has the same parameters as the one you saved or call "save" on this model to reserialize it.

export the model from the new venv

$ python test_tnx_llama_export.py export meta-llama/Llama-3.1-8B-Instruct --save_dir ./llama-bar

Now if you compare the NEFF files in the two save dir you will see that one of them is different.

pagezyhf · 2024-10-07T16:27:07Z

@jeffhataws as you helped working on this last time (2.19 to 2.19.1)?

aws-patlange · 2024-10-07T23:50:31Z

Thank you for reporting this bug. We have reproduced and identified the issue and are working on a fix.

pagezyhf mentioned this issue Oct 7, 2024

Neuron model NEFFs are dependent on the python path #91

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Neuron model NEFFs are dependent on the python path #99

Neuron model NEFFs are dependent on the python path #99

dacorvo commented Oct 7, 2024 •

edited

Loading

pagezyhf commented Oct 7, 2024

aws-patlange commented Oct 7, 2024

Neuron model NEFFs are dependent on the python path #99

Neuron model NEFFs are dependent on the python path #99

Comments

dacorvo commented Oct 7, 2024 • edited Loading

Steps to reproduce

pagezyhf commented Oct 7, 2024

aws-patlange commented Oct 7, 2024

dacorvo commented Oct 7, 2024 •

edited

Loading