-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add initial tiktoken and Phi3SmallTokenizer support #729
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is the unit test?
the unit test you included for pp_api_tests in your PR here: https://github.com/microsoft/onnxruntime-extensions/pull/726/files#diff-f4495557b563fdd392dbf72a0969d541e8288335dc2cd2e48b44d093de843cbfR121 will use this implementation as I moved the test data files into test/data2 so it will run that test, recognize there is no tokenizer.json and use the TikTokenizer. it works successfully as all pp_api_tests pass here: https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1398194&view=logs&j=719f0e07-ada6-5167-1399-95b4f6fab546&t=8a37696c-a395-5292-3877-9451da366bed&l=1653 |
that was testing a converted HF json file, instead of tiktoken file. Please create your own test. |
I thought this one was for the converted HF json file?: https://github.com/microsoft/onnxruntime-extensions/blob/main/test/pp_api_test/test_tokenizer.cc#L93 i remember when you added this and data/tiktoken even has the converted tokenizer.json file and when I asked you mentioned this was the converted HF json file. This one is different: https://github.com/microsoft/onnxruntime-extensions/blob/main/test/pp_api_test/test_tokenizer.cc#L120 maybe you added it twice accidentally or even for local testing? I had basically the same test locally but did not stage it since you already added it. It's not being run as it stands right now because there is nothing in test/data2 folder, so it is skipped. With this PR, it is being run because I added the necessary .tiktoken and tokenizer_config.json files in test/data2/phi-3-small. But regardless, I have no issues adding another UT - let me know if you would like me to. Just wanted to clarify about this benign test at the moment since Phi3_S_Tokenizer right now is not run (I thought you added it for this purpose). Sorry for the confusion. |
…sions into sayanshaw/tiktoken
e51b490
to
6456499
Compare
…sions into sayanshaw/tiktoken
…xruntime-extensions into sayanshaw/tiktoken
…sions into sayanshaw/tiktoken
…sions into sayanshaw/tiktoken
This PR adds official support in ORT Extensions for the new fast tiktoken format introduced by OpenAI for fast loading for tokenizers, as used in the Phi3SmallTokenizer. Currently, encoding is fully supported and tested. Will add decoder support in a separate PR.