Use tiktoken.el for token counting of openai's models #14

zkry · 2024-01-03T04:07:25Z

Hello!

I noticed that one of the methods for the providers is llm-count-tokens which currently does a simple heuristic. I recently wrote a port of tiktoken that could add this functionality for at least the OpenAI models. The implementation in llm-openai.el would essentially look like the following:

(require 'tiktoken)
(cl-defmethod llm-count-tokens ((provider llm-openai) text)
  (let ((enc (tiktoken-encoding-for-model (llm-openai-chat-model provider))))
    (tiktoken-count-tokens enc text)))

There would be some design questions like should it use the chat-model or the embedding-model when doing this. Like maybe it would first try to count with the embedding-model if it exists, otherwise the chat-model, with some default.

Definitely let me know your thoughts and I could have a PR up for it along with any other required work.

The text was updated successfully, but these errors were encountered:

ahyatt · 2024-01-04T06:14:11Z

Very interesting, thanks for sharing this! Before we go further, do you have FSF copyright assignment already, or if not, are you willing to get it? Since this is part of GNU ELPA, all contributions must be from those who have assigned copyright to the FSF.

zkry · 2024-01-05T01:19:42Z

Yeah! I have the FSF copyright paperwork in so I should be good there.

ahyatt · 2024-01-05T02:20:20Z

Great, in that case to use your encoder, we could either put your library in ELPA (you would do this via emacs-devel@ mailing list), which I can then depend on, or include your encoder in the llm library directly.

What's the difference in accuracy, do you think? Is it worth it to include this code?

And as far as embedding vs chat, from what I understand, they use the same encoder, cl100k_base, so for Open AI it shouldn't matter. My library also doesn't make a distinction between tokens for embeddings and chat. Of the two, chat makes the most sense to have token counting for, so it should probably be thought of as providing token counting for chat.

zkry · 2024-01-05T04:47:59Z

What's the difference in accuracy, do you think? Is it worth it to include this code?

Good question. I tested tiktoken vs two different heuristics (one just dividing the number of characters by 4) on a variety of code and text files and these are the results I got:

zoomed in to the lower counts,

And here are only the prose files (the outlier is non-ascii text):

It looks like for English prose both heuristics perform really well. For code, the (/ (buffer-size) 4.0) does seem to perform better.

So if we were to go with (/ (buffer-size) 4.0), here are the percentage differences that would be expected to be off:

so on average, it looks like it would be 10% off. With the current heuristic, it is on average 30% off.

So with all that said I'm not sure how worth it would be. I think the more advanced the use case is the more accurate the count would be wanted. Also, non-ASCII characters seem to be more off with the heuristic as well. But calculating it isn't trivial and (/ (buffer-size) 4.0) would get it most of the way... I'm not sure what would be best.

Let me know if you think including it would be best and I can either add the code or put tiktoken.el to ELPA.
Edit: Maybe just adding the code to this repo would make the most sense as tiktoken.el wouldn't really make sense as a standalone ELPA package.

ahyatt · 2024-01-05T17:18:47Z

Great analysis, thank you so much for that!

Let's keep this issue open - it might become critical in the future, but first I need to do other things before I think we'd need this, namely:

get max token counts per provider / operation (in progress)
develop a prompting system that can flexibly get content up to the max tokens, in ways that make sense for different operations. But how precise things need to be is unclear - do we even want to approach the max? There are disadvantages from doing so, since it should (in theory, at least) decreate conversation quality, which also needs those tokens. If we had a rule like try to get to 2/3 of max, that would mean we don't need to be so precise with the token counting.

Let's see where things take us. Thanks again for developing this library and reaching out about it.

zkry · 2024-01-06T03:28:57Z

Sounds good! I agree that those would be best to tackle first.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use tiktoken.el for token counting of openai's models #14

Use tiktoken.el for token counting of openai's models #14

zkry commented Jan 3, 2024 •

edited

Loading

ahyatt commented Jan 4, 2024

zkry commented Jan 5, 2024

ahyatt commented Jan 5, 2024

zkry commented Jan 5, 2024 •

edited

Loading

ahyatt commented Jan 5, 2024

zkry commented Jan 6, 2024

Use tiktoken.el for token counting of openai's models #14

Use tiktoken.el for token counting of openai's models #14

Comments

zkry commented Jan 3, 2024 • edited Loading

ahyatt commented Jan 4, 2024

zkry commented Jan 5, 2024

ahyatt commented Jan 5, 2024

zkry commented Jan 5, 2024 • edited Loading

ahyatt commented Jan 5, 2024

zkry commented Jan 6, 2024

zkry commented Jan 3, 2024 •

edited

Loading

zkry commented Jan 5, 2024 •

edited

Loading