[HOWTO] how to set --num-prompts when benchmarking #816

lxww302 · 2024-07-29T22:10:15Z

lxww302
Jul 29, 2024

In the benchmark tutorial, we can see that the hyper-parameter num-prompt changes according to --random-input, --random-output and --request-rate, and a large difference can be observed if we change num-prompt. For example, if i increase num-prompt from 300 to 3200, the Median E2E Latency goes up from 22682.12ms to 105282.00ms. Is there any intuitions on how to chose this value?

Answered by Ying1123

Jul 29, 2024

If the request rate exceeds server's capacity, the request will be queue up. Then larger number of prompts cause higher E2E latency.

View full answer

hnyls2002 · 2024-07-29T22:40:26Z

hnyls2002
Jul 29, 2024
Maintainer

@lxww302 Thanks for bringing this. The parameter --num-prompts is decided by many ingredients.

The capacity of KV cache is limited, so the input length and the output length actually decide how many requests can be running at the same time, based on this, you can choose the number of prompts like this:

If you want to test an offline case, the number of prompts should be much larger than your machine's prompts capacity mentioned before.
If you want to test an online case, you have to promise the normal serving status occupies the most of the time.

6 replies

hnyls2002 Jul 31, 2024
Maintainer

@lxww302 No, that does not mean that the larger the num prompt is, the higher the throughput is. Your result is expected. As for why the throughput decreases, it may be related to several issues, including the distribution of the random data, warmup/cool-down variance and scheduling overhead.

hnyls2002 Jul 31, 2024
Maintainer

@lxww302 You can try with --disable-radix-cache when launching the server.

lxww302 Jul 31, 2024
Author

@hnyls2002 I still cannot understand why the larger number of prompt is, the lower the throughput is. Large number of prompts can batchify better until it reaches the performance boundary. so should the throughput increase w.r.t num_prompts until it flattens ?

hnyls2002 Jul 31, 2024
Maintainer

@lxww302 Have you disabled the radix cache ?

lxww302 Jul 31, 2024
Author

@hnyls2002 yes, the command I used for starting the server is python3 -m sglang.launch_server --model-path mistralai/Mistral-7B-Instruct-v0.3 --disable-radix-cache, the SGlang version is 0.2.7

Ying1123 · 2024-07-29T22:41:20Z

Ying1123
Jul 29, 2024
Maintainer

If the request rate exceeds server's capacity, the request will be queue up. Then larger number of prompts cause higher E2E latency.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HOWTO] how to set --num-prompts when benchmarking #816

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[HOWTO] how to set --num-prompts when benchmarking #816

lxww302 Jul 29, 2024

Replies: 2 comments · 6 replies

hnyls2002 Jul 29, 2024 Maintainer

hnyls2002 Jul 31, 2024 Maintainer

hnyls2002 Jul 31, 2024 Maintainer

lxww302 Jul 31, 2024 Author

hnyls2002 Jul 31, 2024 Maintainer

lxww302 Jul 31, 2024 Author

Ying1123 Jul 29, 2024 Maintainer

lxww302
Jul 29, 2024

Replies: 2 comments 6 replies

hnyls2002
Jul 29, 2024
Maintainer

hnyls2002 Jul 31, 2024
Maintainer

hnyls2002 Jul 31, 2024
Maintainer

lxww302 Jul 31, 2024
Author

hnyls2002 Jul 31, 2024
Maintainer

lxww302 Jul 31, 2024
Author

Ying1123
Jul 29, 2024
Maintainer