Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openai_trtllm return 200 directly to the client when TTFT is greater than 15 seconds #53

Open
mynameiskeen opened this issue Aug 22, 2024 · 1 comment

Comments

@mynameiskeen
Copy link

Hi @npuichigo , thanks for this wonderful project, I am having a issue bugged me for a while, please kindly help, thanks a lot.

I am using llmperf to benchmark tritonserver with tensorrt-llm backend and using openai_trtllm as the openai compatible proxy. When the benchmark was running under high concurrency ( like > 20 concurrent requests), the llmperf benchmark failed with error:

2024-08-12 23:57:47,133 INFO worker.py:1781 -- Started a local Ray instance.
  0%|          | 0/200 [00:00<?, ?it/s](OpenAIChatCompletionsClient pid=1151673) Warning Or Error: Expecting value: line 1 column 1 (char 0)
(OpenAIChatCompletionsClient pid=1151673) -1
Traceback (most recent call last):
  File "/data/inference_benchmark/llmperf/token_benchmark_ray.py", line 462, in <module>
    run_token_benchmark(
  File "/data/inference_benchmark/llmperf/token_benchmark_ray.py", line 303, in run_token_benchmark
    summary, individual_responses = get_token_throughput_latencies(
  File "/data/inference_benchmark/llmperf/token_benchmark_ray.py", line 122, in get_token_throughput_latencies
    request_metrics[common_metrics.REQ_OUTPUT_THROUGHPUT] = num_output_tokens / request_metrics[common_metrics.E2E_LAT]
ZeroDivisionError: division by zero
 20%|██        | 40/200 [00:49<03:16,  1.23s/it]

I added the print in the token_benchmark_ray.py under the code line 113/114 to print the response from openai_trtllm.

        if not (iter % num_concurrent_requests):
            outs = req_launcher.get_next_ready()
            all_metrics = []
            for out in outs:
                print("-----------------out is :", out)
                request_metrics, gen_text, _ = out
                print("-----------------Gen text is :", gen_text)
                num_output_tokens = get_token_length(gen_text)
                if num_output_tokens: 

It seems that the failed request returned with empty response body:

# error_code is "-1"
-----------------out is : ({'error_code': -1, 'error_msg': '', 'inter_token_latency_s': 0, 'ttft_s': 0, 'end_to_end_latency_s': 0, 'request_output_throughput_token_per_s': 0, 'number_total_tokens': 6001, 'number_output_tokens': 1, 'number_input_tokens': 6000}, 

'... which I will keep so chary\nThe ", 6000), sampling_params={'max_tokens': 500}, llm_api='openai', metadata=None))'

# gen_text was empty
-----------------Gen text is : 

So, I used tcpdump to capture all the requests to openai_trtllm, and found that the failed request was returned with a 200 code after exactly 15 seconds with maybe only one or two tokens ( I don't know if they're tokens from tritonserver or not) . And meanwhile tritonserver had no errors. Please see the screen shoots

image

image

@npuichigo
Copy link
Owner

I'm not sure if the input tokens have exceeded the max tokens. You can also check the postprocessing part in triton to debug the generated tokens if possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants