Skip to content

Commit

Permalink
Improve handling of missing vocab_file attribute in HFTokenizerConv…
Browse files Browse the repository at this point in the history
…erter (#677)

This commit updates `HFTokenizerConverter` to handle cases where the `hf_tokenizer` object might not have a `vocab_file` attribute.

Changes:

* Uses `getattr` to retrieve the `vocab_file` attribute for flexibility
* Stores the retrieved value in a separate variable `vocab_file` for clarity
* Checks if `vocab_file` is `None` before checking its existence

This ensures the converter works correctly even with tokenizers that don't define a `vocab_file` attribute.
  • Loading branch information
kazssym authored Mar 26, 2024
1 parent 29a4b49 commit 31f129c
Showing 1 changed file with 3 additions and 2 deletions.
5 changes: 3 additions & 2 deletions onnxruntime_extensions/_hf_cvt.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,10 +43,11 @@ def convert_json_vocab(hf_tokenizer):
f"{hf_tokenizer.__name__}: vocab_files_names is not found")

tokenizer_file = filenames["tokenizer_file"]
if (hf_tokenizer.vocab_file is None) or (not os.path.exists(hf_tokenizer.vocab_file)):
vocab_file = getattr(hf_tokenizer, "vocab_file", None)
if (vocab_file is None) or (not os.path.exists(vocab_file)):
model_dir = hf_tokenizer.name_or_path
else:
model_dir = os.path.dirname(hf_tokenizer.vocab_file)
model_dir = os.path.dirname(vocab_file)
tokenizer_json = json.load(
open(os.path.join(model_dir, tokenizer_file), "r", encoding="utf-8"))
# get vocab object from json file
Expand Down

0 comments on commit 31f129c

Please sign in to comment.