Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Special Token Mapping Incorrect #1293

Open
farris opened this issue Mar 3, 2025 · 2 comments
Open

Bug: Special Token Mapping Incorrect #1293

farris opened this issue Mar 3, 2025 · 2 comments

Comments

@farris
Copy link

farris commented Mar 3, 2025

Describe the bug
The tokenizer_stream.decode api results in incorrect mapping when special tokens are included in the models tokenizer. The token IDs which are emitted are correct (and confirmed by directly checking against the torch api and computed argmax(softmax(logit)) directly, however the mapping of the resulting token id to the final string is wrong

To Reproduce
Steps to reproduce the behavior:

def infer_onnx(target_model_path, user_queries, model_name):

    print("-"*30)
    print("onnx eval")
    print("-"*30)

    model = og.Model(target_model_path)
    tokenizer_og = og.Tokenizer(model)
    tokenizer_stream = tokenizer_og.create_stream()
    
    search_options = {}
    function_store_target = []
    arg_store_target = []
    try:
        for query in tqdm(user_queries):
            prompt = format_query(query, model_name)
            tokenizer = AutoTokenizer.from_pretrained("/<path_to_model_with_special_tokens>/phi3.5-mini/safetensors") 

            messages = [ 
                {"role": "system", "content": "You are a helpful assistant."}, 
                {"role": "user", "content": query},
            ] 
            
          
            input_tokens = tokenizer.apply_chat_template(messages, tokenize=True)

            params = og.GeneratorParams(model)
            params.set_search_options(**search_options)
            generator = og.Generator(model, params)
            generator.append_tokens(input_tokens)

            response = ""
            while not generator.is_done():
                generator.generate_next_token()
                
                # check with torch
                logits = torch.from_numpy(generator.get_output("logits"))[:, -1, :]
                logits_scaled = F.softmax(logits, dim=-1)
                idx_torch = torch.argmax(logits_scaled)
                new_token_og = generator.get_next_tokens()[0]
                print(idx_torch == new_token_og)
                 # check with torch
                
                  response += tokenizer_stream.decode(new_token_og)

            function, args_ = extract_functions_and_args(response)
            function_store_target.append(function)
            arg_store_target.append(args_)

    except KeyboardInterrupt:
        print("  --control+c pressed, aborting generation--")
        print()
        del generator
    
    return function_store_target, arg_store_target

Expected behavior
The token IDs from torch and onnxgenai always match, however the outputs are completely off

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: [e.g. iOS] -> Mac M3
  • Version [e.g. 22] -> OGA nightly build
@wenbingl
Copy link
Member

wenbingl commented Mar 4, 2025

@farris , can you share us the tokenizer.json and tokenizer_config.json files for inverstigation?

@wenbingl
Copy link
Member

wenbingl commented Mar 7, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants