-
In the benchmark tutorial, we can see that the hyper-parameter |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 6 replies
-
@lxww302 Thanks for bringing this. The parameter The capacity of KV cache is limited, so the input length and the output length actually decide how many requests can be running at the same time, based on this, you can choose the number of prompts like this:
|
Beta Was this translation helpful? Give feedback.
-
If the request rate exceeds server's capacity, the request will be queue up. Then larger number of prompts cause higher E2E latency. |
Beta Was this translation helpful? Give feedback.
If the request rate exceeds server's capacity, the request will be queue up. Then larger number of prompts cause higher E2E latency.