You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi @npuichigo , thanks for this wonderful project, I am having a issue bugged me for a while, please kindly help, thanks a lot.
I am using llmperf to benchmark tritonserver with tensorrt-llm backend and using openai_trtllm as the openai compatible proxy. When the benchmark was running under high concurrency ( like > 20 concurrent requests), the llmperf benchmark failed with error:
2024-08-12 23:57:47,133 INFO worker.py:1781 -- Started a local Ray instance.
0%|| 0/200 [00:00<?, ?it/s](OpenAIChatCompletionsClient pid=1151673) Warning Or Error: Expecting value: line 1 column 1 (char 0)
(OpenAIChatCompletionsClient pid=1151673) -1
Traceback (most recent call last):
File "/data/inference_benchmark/llmperf/token_benchmark_ray.py", line 462, in<module>
run_token_benchmark(
File "/data/inference_benchmark/llmperf/token_benchmark_ray.py", line 303, in run_token_benchmark
summary, individual_responses = get_token_throughput_latencies(
File "/data/inference_benchmark/llmperf/token_benchmark_ray.py", line 122, in get_token_throughput_latencies
request_metrics[common_metrics.REQ_OUTPUT_THROUGHPUT] = num_output_tokens / request_metrics[common_metrics.E2E_LAT]
ZeroDivisionError: division by zero
20%|██ | 40/200 [00:49<03:16, 1.23s/it]
I added the print in the token_benchmark_ray.py under the code line 113/114 to print the response from openai_trtllm.
ifnot (iter%num_concurrent_requests):
outs=req_launcher.get_next_ready()
all_metrics= []
foroutinouts:
print("-----------------out is :", out)
request_metrics, gen_text, _=outprint("-----------------Gen text is :", gen_text)
num_output_tokens=get_token_length(gen_text)
ifnum_output_tokens:
It seems that the failed request returned with empty response body:
# error_code is "-1"
-----------------out is : ({'error_code': -1, 'error_msg': '', 'inter_token_latency_s': 0, 'ttft_s': 0, 'end_to_end_latency_s': 0, 'request_output_throughput_token_per_s': 0, 'number_total_tokens': 6001, 'number_output_tokens': 1, 'number_input_tokens': 6000},
'... which I will keep so chary\nThe ", 6000), sampling_params={'max_tokens': 500}, llm_api='openai', metadata=None))'# gen_text was empty
-----------------Gen text is :
So, I used tcpdump to capture all the requests to openai_trtllm, and found that the failed request was returned with a 200 code after exactly 15 seconds with maybe only one or two tokens ( I don't know if they're tokens from tritonserver or not) . And meanwhile tritonserver had no errors. Please see the screen shoots
The text was updated successfully, but these errors were encountered:
I'm not sure if the input tokens have exceeded the max tokens. You can also check the postprocessing part in triton to debug the generated tokens if possible.
Hi @npuichigo , thanks for this wonderful project, I am having a issue bugged me for a while, please kindly help, thanks a lot.
I am using llmperf to benchmark tritonserver with tensorrt-llm backend and using openai_trtllm as the openai compatible proxy. When the benchmark was running under high concurrency ( like > 20 concurrent requests), the llmperf benchmark failed with error:
I added the print in the token_benchmark_ray.py under the code line 113/114 to print the response from openai_trtllm.
It seems that the failed request returned with empty response body:
So, I used tcpdump to capture all the requests to openai_trtllm, and found that the failed request was returned with a 200 code after exactly 15 seconds with maybe only one or two tokens ( I don't know if they're tokens from tritonserver or not) . And meanwhile tritonserver had no errors. Please see the screen shoots
The text was updated successfully, but these errors were encountered: