-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Whisper triton server inference on jetson is failing with : Assertion failed: Encoder tokens are not given #688
Comments
@sasatte We didn't test on Jetson. You may also share more info about your setting e.g. docker image, trt-llm version
|
Me too.
|
@sh1man999 I believe your issue was caused by running out of GPU memory on the 4080. While you might get large-v3 to work by adjusting some parameters during building, such as decreasing maxBlocksPerSeq-related settings, I recommend using Whisper Large Turbo on GPUs like the 4080 or NVIDIA Jetson. You may also try the int8 weight only quant to reduce VRAM usage. |
Following instructions here https://github.com/k2-fsa/sherpa/tree/master/triton/whisper for Jetson orin NX. All went well but the inference is failing with following excecption.
[TensorRT-LLM][ERROR] Encountered an error when fetching new request: [TensorRT-LLM][ERROR] Assertion failed: Encoder tokens are not given (/home/nvidia/disk/workspace/jetson_llm/tekit/cpp/include/tensorrt_llm/batch_manager/llmRequest.h:353)
1 0xfffe90dc11f8 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 104
2 0xfffe92133920 tensorrt_llm::executor::Executor::Impl::fetchNewRequests[abi:cxx11](int, std::optional) + 4128
3 0xfffe921348bc tensorrt_llm::executor::Executor::Impl::executionLoop() + 652
4 0xffff96e231fc /lib/aarch64-linux-gnu/libstdc++.so.6(+0xd31fc) [0xffff96e231fc]
5 0xffff96bed5c8 /lib/aarch64-linux-gnu/libc.so.6(+0x7d5c8) [0xffff96bed5c8]
6 0xffff96c55edc /lib/aarch64-linux-gnu/libc.so.6(+0xe5edc) [0xffff96c55edc]
E1229 08:04:47.723718 1454721 pb_stub.cc:714] "Failed to process the request(s) for model 'whisper_0_0', message: RuntimeError: Encountered an error when fetching new request: [TensorRT-LLM][ERROR] Assertion failed: Encoder tokens are not given (/home/nvidia/disk/workspace/jetson_llm/tekit/cpp/include/tensorrt_llm/batch_manager/llmRequest.h:353)\n1 0xfffe90dc11f8 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 104\n2 0xfffe92133920 tensorrt_llm::executor::Executor::Impl::fetchNewRequests[abi:cxx11](int, std::optional) + 4128\n3 0xfffe921348bc tensorrt_llm::executor::Executor::Impl::executionLoop() + 652\n4 0xffff96e231fc /lib/aarch64-linux-gnu/libstdc++.so.6(+0xd31fc) [0xffff96e231fc]\n5 0xffff96bed5c8 /lib/aarch64-linux-gnu/libc.so.6(+0x7d5c8) [0xffff96bed5c8]\n6 0xffff96c55edc /lib/aarch64-linux-gnu/libc.so.6(+0xe5edc) [0xffff96c55edc]\n\nAt:\n /home/psasatte/.local/lib/python3.10/site-packages/tensorrt_llm/runtime/model_runner_cpp.py(578): _fill_output\n /home/psasatte/.local/lib/python3.10/site-packages/tensorrt_llm/runtime/model_runner_cpp.py(540): _initialize_and_fill_output\n /home/psasatte/.local/lib/python3.10/site-packages/tensorrt_llm/runtime/model_runner_cpp.py(485): generate\n /home/psasatte/TensorRT-LLM/examples/whisper/sherpa/triton/whisper/./model_repo_whisper/whisper/1/model.py(103): execute\n"
E1229 08:04:47.724023 1454721 pb_stub.cc:714] "Failed to process the request(s) for model 'infer_bls_0_1', message: TritonModelException: Failed to open the cudaIpcHandle. error: invalid argument\n\nAt:\n /home/psasatte/TensorRT-LLM/examples/whisper/sherpa/triton/whisper/./model_repo_whisper/infer_bls/1/model.py(75): process_batch\n /home/psasatte/TensorRT-LLM/examples/whisper/sherpa/triton/whisper/./model_repo_whisper/infer_bls/1/model.py(106): execute\n"
The text was updated successfully, but these errors were encountered: