openai_trtllm return 200 directly to the client when TTFT is greater than 15 seconds #53

mynameiskeen · 2024-08-22T02:02:19Z

Hi @npuichigo , thanks for this wonderful project, I am having a issue bugged me for a while, please kindly help, thanks a lot.

I am using llmperf to benchmark tritonserver with tensorrt-llm backend and using openai_trtllm as the openai compatible proxy. When the benchmark was running under high concurrency ( like > 20 concurrent requests), the llmperf benchmark failed with error:

2024-08-12 23:57:47,133 INFO worker.py:1781 -- Started a local Ray instance.
  0%|          | 0/200 [00:00<?, ?it/s](OpenAIChatCompletionsClient pid=1151673) Warning Or Error: Expecting value: line 1 column 1 (char 0)
(OpenAIChatCompletionsClient pid=1151673) -1
Traceback (most recent call last):
  File "/data/inference_benchmark/llmperf/token_benchmark_ray.py", line 462, in <module>
    run_token_benchmark(
  File "/data/inference_benchmark/llmperf/token_benchmark_ray.py", line 303, in run_token_benchmark
    summary, individual_responses = get_token_throughput_latencies(
  File "/data/inference_benchmark/llmperf/token_benchmark_ray.py", line 122, in get_token_throughput_latencies
    request_metrics[common_metrics.REQ_OUTPUT_THROUGHPUT] = num_output_tokens / request_metrics[common_metrics.E2E_LAT]
ZeroDivisionError: division by zero
 20%|██        | 40/200 [00:49<03:16,  1.23s/it]

I added the print in the token_benchmark_ray.py under the code line 113/114 to print the response from openai_trtllm.

        if not (iter % num_concurrent_requests):
            outs = req_launcher.get_next_ready()
            all_metrics = []
            for out in outs:
                print("-----------------out is :", out)
                request_metrics, gen_text, _ = out
                print("-----------------Gen text is :", gen_text)
                num_output_tokens = get_token_length(gen_text)
                if num_output_tokens:

It seems that the failed request returned with empty response body:

# error_code is "-1"
-----------------out is : ({'error_code': -1, 'error_msg': '', 'inter_token_latency_s': 0, 'ttft_s': 0, 'end_to_end_latency_s': 0, 'request_output_throughput_token_per_s': 0, 'number_total_tokens': 6001, 'number_output_tokens': 1, 'number_input_tokens': 6000}, 

'... which I will keep so chary\nThe ", 6000), sampling_params={'max_tokens': 500}, llm_api='openai', metadata=None))'

# gen_text was empty
-----------------Gen text is :

So, I used tcpdump to capture all the requests to openai_trtllm, and found that the failed request was returned with a 200 code after exactly 15 seconds with maybe only one or two tokens ( I don't know if they're tokens from tritonserver or not) . And meanwhile tritonserver had no errors. Please see the screen shoots

npuichigo · 2024-08-22T04:48:36Z

I'm not sure if the input tokens have exceeded the max tokens. You can also check the postprocessing part in triton to debug the generated tokens if possible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

openai_trtllm return 200 directly to the client when TTFT is greater than 15 seconds #53

openai_trtllm return 200 directly to the client when TTFT is greater than 15 seconds #53

mynameiskeen commented Aug 22, 2024

npuichigo commented Aug 22, 2024

openai_trtllm return 200 directly to the client when TTFT is greater than 15 seconds #53

openai_trtllm return 200 directly to the client when TTFT is greater than 15 seconds #53

Comments

mynameiskeen commented Aug 22, 2024

npuichigo commented Aug 22, 2024