Bug: Special Token Mapping Incorrect #1293

farris · 2025-03-03T19:34:53Z

Describe the bug
The tokenizer_stream.decode api results in incorrect mapping when special tokens are included in the models tokenizer. The token IDs which are emitted are correct (and confirmed by directly checking against the torch api and computed argmax(softmax(logit)) directly, however the mapping of the resulting token id to the final string is wrong

To Reproduce
Steps to reproduce the behavior:

def infer_onnx(target_model_path, user_queries, model_name):

    print("-"*30)
    print("onnx eval")
    print("-"*30)

    model = og.Model(target_model_path)
    tokenizer_og = og.Tokenizer(model)
    tokenizer_stream = tokenizer_og.create_stream()
    
    search_options = {}
    function_store_target = []
    arg_store_target = []
    try:
        for query in tqdm(user_queries):
            prompt = format_query(query, model_name)
            tokenizer = AutoTokenizer.from_pretrained("/<path_to_model_with_special_tokens>/phi3.5-mini/safetensors") 

            messages = [ 
                {"role": "system", "content": "You are a helpful assistant."}, 
                {"role": "user", "content": query},
            ] 
            
          
            input_tokens = tokenizer.apply_chat_template(messages, tokenize=True)

            params = og.GeneratorParams(model)
            params.set_search_options(**search_options)
            generator = og.Generator(model, params)
            generator.append_tokens(input_tokens)

            response = ""
            while not generator.is_done():
                generator.generate_next_token()
                
                # check with torch
                logits = torch.from_numpy(generator.get_output("logits"))[:, -1, :]
                logits_scaled = F.softmax(logits, dim=-1)
                idx_torch = torch.argmax(logits_scaled)
                new_token_og = generator.get_next_tokens()[0]
                print(idx_torch == new_token_og)
                 # check with torch
                
                  response += tokenizer_stream.decode(new_token_og)

            function, args_ = extract_functions_and_args(response)
            function_store_target.append(function)
            arg_store_target.append(args_)

    except KeyboardInterrupt:
        print("  --control+c pressed, aborting generation--")
        print()
        del generator
    
    return function_store_target, arg_store_target

Expected behavior
The token IDs from torch and onnxgenai always match, however the outputs are completely off

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: [e.g. iOS] -> Mac M3
Version [e.g. 22] -> OGA nightly build

wenbingl · 2025-03-04T00:25:55Z

@farris , can you share us the tokenizer.json and tokenizer_config.json files for inverstigation?

wenbingl · 2025-03-07T22:19:03Z

PR: microsoft/onnxruntime-extensions#908

microsoft-github-policy-service bot added the platform:mobile label Mar 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Special Token Mapping Incorrect #1293

Bug: Special Token Mapping Incorrect #1293

farris commented Mar 3, 2025 •

edited

Loading

wenbingl commented Mar 4, 2025

wenbingl commented Mar 7, 2025

Bug: Special Token Mapping Incorrect #1293

Bug: Special Token Mapping Incorrect #1293

Comments

farris commented Mar 3, 2025 • edited Loading

wenbingl commented Mar 4, 2025

wenbingl commented Mar 7, 2025

farris commented Mar 3, 2025 •

edited

Loading