Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use tiktoken.el for token counting of openai's models #14

Open
zkry opened this issue Jan 3, 2024 · 6 comments
Open

Use tiktoken.el for token counting of openai's models #14

zkry opened this issue Jan 3, 2024 · 6 comments

Comments

@zkry
Copy link

zkry commented Jan 3, 2024

Hello!

I noticed that one of the methods for the providers is llm-count-tokens which currently does a simple heuristic. I recently wrote a port of tiktoken that could add this functionality for at least the OpenAI models. The implementation in llm-openai.el would essentially look like the following:

(require 'tiktoken)
(cl-defmethod llm-count-tokens ((provider llm-openai) text)
  (let ((enc (tiktoken-encoding-for-model (llm-openai-chat-model provider))))
    (tiktoken-count-tokens enc text)))

There would be some design questions like should it use the chat-model or the embedding-model when doing this. Like maybe it would first try to count with the embedding-model if it exists, otherwise the chat-model, with some default.

Definitely let me know your thoughts and I could have a PR up for it along with any other required work.

@ahyatt
Copy link
Owner

ahyatt commented Jan 4, 2024

Very interesting, thanks for sharing this! Before we go further, do you have FSF copyright assignment already, or if not, are you willing to get it? Since this is part of GNU ELPA, all contributions must be from those who have assigned copyright to the FSF.

@zkry
Copy link
Author

zkry commented Jan 5, 2024

Yeah! I have the FSF copyright paperwork in so I should be good there.

@ahyatt
Copy link
Owner

ahyatt commented Jan 5, 2024

Great, in that case to use your encoder, we could either put your library in ELPA (you would do this via emacs-devel@ mailing list), which I can then depend on, or include your encoder in the llm library directly.

What's the difference in accuracy, do you think? Is it worth it to include this code?

And as far as embedding vs chat, from what I understand, they use the same encoder, cl100k_base, so for Open AI it shouldn't matter. My library also doesn't make a distinction between tokens for embeddings and chat. Of the two, chat makes the most sense to have token counting for, so it should probably be thought of as providing token counting for chat.

@zkry
Copy link
Author

zkry commented Jan 5, 2024

What's the difference in accuracy, do you think? Is it worth it to include this code?

Good question. I tested tiktoken vs two different heuristics (one just dividing the number of characters by 4) on a variety of code and text files and these are the results I got:

comparison

zoomed in to the lower counts,
comparison_zoomed

And here are only the prose files (the outlier is non-ascii text):
prose

It looks like for English prose both heuristics perform really well. For code, the (/ (buffer-size) 4.0) does seem to perform better.

So if we were to go with (/ (buffer-size) 4.0), here are the percentage differences that would be expected to be off:

Figure_3

so on average, it looks like it would be 10% off. With the current heuristic, it is on average 30% off.

So with all that said I'm not sure how worth it would be. I think the more advanced the use case is the more accurate the count would be wanted. Also, non-ASCII characters seem to be more off with the heuristic as well. But calculating it isn't trivial and (/ (buffer-size) 4.0) would get it most of the way... I'm not sure what would be best.

Let me know if you think including it would be best and I can either add the code or put tiktoken.el to ELPA.
Edit: Maybe just adding the code to this repo would make the most sense as tiktoken.el wouldn't really make sense as a standalone ELPA package.

@ahyatt
Copy link
Owner

ahyatt commented Jan 5, 2024

Great analysis, thank you so much for that!

Let's keep this issue open - it might become critical in the future, but first I need to do other things before I think we'd need this, namely:

  1. get max token counts per provider / operation (in progress)
  2. develop a prompting system that can flexibly get content up to the max tokens, in ways that make sense for different operations. But how precise things need to be is unclear - do we even want to approach the max? There are disadvantages from doing so, since it should (in theory, at least) decreate conversation quality, which also needs those tokens. If we had a rule like try to get to 2/3 of max, that would mean we don't need to be so precise with the token counting.

Let's see where things take us. Thanks again for developing this library and reaching out about it.

@zkry
Copy link
Author

zkry commented Jan 6, 2024

Sounds good! I agree that those would be best to tackle first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants