Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HF dataset initializer v2 fails with KeyError: 'tags' when downloading datasets with no tags #2378

Closed
astefanutti opened this issue Jan 7, 2025 · 1 comment · Fixed by #2379
Labels

Comments

@astefanutti
Copy link
Contributor

What happened?

Create a TrainJob with a dataset config, e.g.:

client.train(
    runtime_ref="torch-distributed",
    dataset_config=HuggingFaceDatasetConfig(
        storage_uri="ylecun/mnist",
    ),
...

The dataset init container fails with:

2025-01-07T17:25:20Z INFO     [__main__.py:15] Starting dataset initialization
2025-01-07T17:25:20Z INFO     [huggingface.py:26] Downloading dataset: ylecun/mnist
2025-01-07T17:25:20Z INFO     [huggingface.py:27] ----------------------------------------
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/workspace/pkg/initializer_v2/dataset/__main__.py", line 28, in <module>
    hf.download_dataset()
  File "/workspace/pkg/initializer_v2/dataset/huggingface.py", line 32, in download_dataset
    huggingface_hub.snapshot_download(
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/_snapshot_download.py", line 164, in snapshot_download
    repo_info = api.repo_info(repo_id=repo_id, repo_type=repo_type, revision=revision, token=token)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/hf_api.py", line 2491, in repo_info
    return method(
           ^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/hf_api.py", line 2366, in dataset_info
    return DatasetInfo(**data)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/hf_api.py", line 799, in __init__
    self.tags = kwargs.pop("tags")
                ^^^^^^^^^^^^^^^^^^
KeyError: 'tags'

What did you expect to happen?

Dataset initialization should work.

Environment

Kubernetes version:

$ kubectl version
Client Version: v1.31.1
Kustomize Version: v5.4.2
Server Version: v1.27.11+ec42b99

Training Operator Python SDK version:

$ pip show kubeflow-training
Version: 2.0.0

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

@andreyvelich
Copy link
Member

/remove-label lifecycle/needs-triage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants