Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial tiktoken and Phi3SmallTokenizer support #729

Merged
merged 25 commits into from
Aug 2, 2024

Conversation

sayanshaw24
Copy link
Contributor

@sayanshaw24 sayanshaw24 commented May 24, 2024

This PR adds official support in ORT Extensions for the new fast tiktoken format introduced by OpenAI for fast loading for tokenizers, as used in the Phi3SmallTokenizer. Currently, encoding is fully supported and tested. Will add decoder support in a separate PR.

Copy link
Member

@wenbingl wenbingl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the unit test?

@sayanshaw24
Copy link
Contributor Author

Where is the unit test?

the unit test you included for pp_api_tests in your PR here: https://github.com/microsoft/onnxruntime-extensions/pull/726/files#diff-f4495557b563fdd392dbf72a0969d541e8288335dc2cd2e48b44d093de843cbfR121 will use this implementation as I moved the test data files into test/data2 so it will run that test, recognize there is no tokenizer.json and use the TikTokenizer.

it works successfully as all pp_api_tests pass here: https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1398194&view=logs&j=719f0e07-ada6-5167-1399-95b4f6fab546&t=8a37696c-a395-5292-3877-9451da366bed&l=1653

@wenbingl
Copy link
Member

wenbingl commented Jun 4, 2024

Where is the unit test?

the unit test you included for pp_api_tests in your PR here: https://github.com/microsoft/onnxruntime-extensions/pull/726/files#diff-f4495557b563fdd392dbf72a0969d541e8288335dc2cd2e48b44d093de843cbfR121 will use this implementation as I moved the test data files into test/data2 so it will run that test, recognize there is no tokenizer.json and use the TikTokenizer.

it works successfully as all pp_api_tests pass here: https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1398194&view=logs&j=719f0e07-ada6-5167-1399-95b4f6fab546&t=8a37696c-a395-5292-3877-9451da366bed&l=1653

that was testing a converted HF json file, instead of tiktoken file. Please create your own test.

@sayanshaw24
Copy link
Contributor Author

Where is the unit test?

the unit test you included for pp_api_tests in your PR here: https://github.com/microsoft/onnxruntime-extensions/pull/726/files#diff-f4495557b563fdd392dbf72a0969d541e8288335dc2cd2e48b44d093de843cbfR121 will use this implementation as I moved the test data files into test/data2 so it will run that test, recognize there is no tokenizer.json and use the TikTokenizer.
it works successfully as all pp_api_tests pass here: https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1398194&view=logs&j=719f0e07-ada6-5167-1399-95b4f6fab546&t=8a37696c-a395-5292-3877-9451da366bed&l=1653

that was testing a converted HF json file, instead of tiktoken file. Please create your own test.

I thought this one was for the converted HF json file?: https://github.com/microsoft/onnxruntime-extensions/blob/main/test/pp_api_test/test_tokenizer.cc#L93 i remember when you added this and data/tiktoken even has the converted tokenizer.json file and when I asked you mentioned this was the converted HF json file.

This one is different: https://github.com/microsoft/onnxruntime-extensions/blob/main/test/pp_api_test/test_tokenizer.cc#L120 maybe you added it twice accidentally or even for local testing? I had basically the same test locally but did not stage it since you already added it. It's not being run as it stands right now because there is nothing in test/data2 folder, so it is skipped. With this PR, it is being run because I added the necessary .tiktoken and tokenizer_config.json files in test/data2/phi-3-small.

But regardless, I have no issues adding another UT - let me know if you would like me to. Just wanted to clarify about this benign test at the moment since Phi3_S_Tokenizer right now is not run (I thought you added it for this purpose).

Sorry for the confusion.

@sayanshaw24 sayanshaw24 force-pushed the sayanshaw/tiktoken branch from e51b490 to 6456499 Compare June 6, 2024 18:04
@sayanshaw24 sayanshaw24 changed the title Add initial tiktoken support Add initial tiktoken and Phi3SmallTokenizer support Aug 1, 2024
@sayanshaw24 sayanshaw24 marked this pull request as ready for review August 1, 2024 15:21
@sayanshaw24 sayanshaw24 requested a review from a team as a code owner August 1, 2024 15:21
@sayanshaw24 sayanshaw24 merged commit 7851b51 into main Aug 2, 2024
46 checks passed
@sayanshaw24 sayanshaw24 deleted the sayanshaw/tiktoken branch August 2, 2024 17:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants