Whisper triton server inference on jetson is failing with : Assertion failed: Encoder tokens are not given #688

sasatte · 2024-12-29T08:06:15Z

Following instructions here https://github.com/k2-fsa/sherpa/tree/master/triton/whisper for Jetson orin NX. All went well but the inference is failing with following excecption.

[TensorRT-LLM][ERROR] Encountered an error when fetching new request: [TensorRT-LLM][ERROR] Assertion failed: Encoder tokens are not given (/home/nvidia/disk/workspace/jetson_llm/tekit/cpp/include/tensorrt_llm/batch_manager/llmRequest.h:353)
1 0xfffe90dc11f8 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 104
2 0xfffe92133920 tensorrt_llm::executor::Executor::Impl::fetchNewRequests[abi:cxx11](int, std::optional) + 4128
3 0xfffe921348bc tensorrt_llm::executor::Executor::Impl::executionLoop() + 652
4 0xffff96e231fc /lib/aarch64-linux-gnu/libstdc++.so.6(+0xd31fc) [0xffff96e231fc]
5 0xffff96bed5c8 /lib/aarch64-linux-gnu/libc.so.6(+0x7d5c8) [0xffff96bed5c8]
6 0xffff96c55edc /lib/aarch64-linux-gnu/libc.so.6(+0xe5edc) [0xffff96c55edc]
E1229 08:04:47.723718 1454721 pb_stub.cc:714] "Failed to process the request(s) for model 'whisper_0_0', message: RuntimeError: Encountered an error when fetching new request: [TensorRT-LLM][ERROR] Assertion failed: Encoder tokens are not given (/home/nvidia/disk/workspace/jetson_llm/tekit/cpp/include/tensorrt_llm/batch_manager/llmRequest.h:353)\n1 0xfffe90dc11f8 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 104\n2 0xfffe92133920 tensorrt_llm::executor::Executor::Impl::fetchNewRequests[abi:cxx11](int, std::optional) + 4128\n3 0xfffe921348bc tensorrt_llm::executor::Executor::Impl::executionLoop() + 652\n4 0xffff96e231fc /lib/aarch64-linux-gnu/libstdc++.so.6(+0xd31fc) [0xffff96e231fc]\n5 0xffff96bed5c8 /lib/aarch64-linux-gnu/libc.so.6(+0x7d5c8) [0xffff96bed5c8]\n6 0xffff96c55edc /lib/aarch64-linux-gnu/libc.so.6(+0xe5edc) [0xffff96c55edc]\n\nAt:\n /home/psasatte/.local/lib/python3.10/site-packages/tensorrt_llm/runtime/model_runner_cpp.py(578): _fill_output\n /home/psasatte/.local/lib/python3.10/site-packages/tensorrt_llm/runtime/model_runner_cpp.py(540): _initialize_and_fill_output\n /home/psasatte/.local/lib/python3.10/site-packages/tensorrt_llm/runtime/model_runner_cpp.py(485): generate\n /home/psasatte/TensorRT-LLM/examples/whisper/sherpa/triton/whisper/./model_repo_whisper/whisper/1/model.py(103): execute\n"
E1229 08:04:47.724023 1454721 pb_stub.cc:714] "Failed to process the request(s) for model 'infer_bls_0_1', message: TritonModelException: Failed to open the cudaIpcHandle. error: invalid argument\n\nAt:\n /home/psasatte/TensorRT-LLM/examples/whisper/sherpa/triton/whisper/./model_repo_whisper/infer_bls/1/model.py(75): process_batch\n /home/psasatte/TensorRT-LLM/examples/whisper/sherpa/triton/whisper/./model_repo_whisper/infer_bls/1/model.py(106): execute\n"

yuekaizhang · 2025-01-02T03:15:37Z

@sasatte We didn't test on Jetson. You may also share more info about your setting e.g. docker image, trt-llm version

Can whisper triton server work on normal GPUs with your setting?
Can you make other TRT-LLM recipe e.g. llama, qwen work on Jetson?

sh1man999 · 2025-01-20T22:54:51Z

Me too.
I'm using a regular pc with a 4080 video card.

asr-1  | E0120 22:47:46.156718 2634 backend_model.cc:692] "ERROR: Failed to create instance: RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: maxTokensInPagedKvCache (1856) must be large enough to process at least 1 sequence to completion (i.e. must be larger than beam_width (1) * tokensPerBlock (64) * maxBlocksPerSeq (47)) (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:769)\n1       0x7f03651c6cd7 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 82\n2       0x7f036520780c /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x7a680c) [0x7f036520780c]\n3       0x7f03674100b1 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::createKvCacheManager(tensorrt_llm::batch_manager::kv_cache_manager::KvCacheConfig const&, tensorrt_llm::batch_manager::kv_cache_manager::CacheType) + 1377\n4       0x7f0367411295 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::runtime::RawEngine const&, bool, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 3061\n5       0x7f03673bc6e4 tensorrt_llm::batch_manager::TrtGptModelFactory::create(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::batch_manager::TrtGptModelType, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 756\n6       0x7f036744296d tensorrt_llm::executor::Executor::Impl::createModel(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 125\n7       0x7f0367442f2a tensorrt_llm::executor::Executor::Impl::loadModel(std::optional<std::filesystem::path> const&, std::optional<std::basic_string_view<unsigned char, std::char_traits<unsigned char> > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool, std::optional<std::map<std::string, tensorrt_llm::executor::Tensor, std::less<std::string>, std::allocator<std::pair<std::string const, tensorrt_llm::executor::Tensor> > > > const&) + 954\n8       0x7f0367443f67 tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::path const&, std::optional<std::filesystem::path> const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 2135\n9       0x7f036742f733 tensorrt_llm::executor::Executor::Executor(std::filesystem::path const&, std::filesystem::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 99\n10      0x7f03d5e76419 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xdd419) [0x7f03d5e76419]\n11      0x7f03d5df992f /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x6092f) [0x7f03d5df992f]\n12      0x7f04c6751023 /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0x128023) [0x7f04c6751023]\n13      0x7f04c6708adc _PyObject_MakeTpCall + 140\n14      0x7f04c670b43a /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0xe243a) [0x7f04c670b43a]\n15      0x7f04c6778172 /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0x14f172) [0x7f04c6778172]\n16      0x7f04c676c88e /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0x14388e) [0x7f04c676c88e]\n17      0x7f03d5df375b /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x5a75b) [0x7f03d5df375b]\n18      0x7f04c6708adc _PyObject_MakeTpCall + 140\n19      0x7f04c66a4a1c _PyEval_EvalFrameDefault + 40380\n20      0x7f04c67eb3af /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0x1c23af) [0x7f04c67eb3af]\n21      0x7f04c670b3d8 /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0xe23d8) [0x7f04c670b3d8]\n22      0x7f04c670aed8 PyVectorcall_Call + 168\n23      0x7f04c669f776 _PyEval_EvalFrameDefault + 19222\n24      0x7f04c67eb3af /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0x1c23af) [0x7f04c67eb3af]\n25      0x7f04c670b358 /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0xe2358) [0x7f04c670b358]\n26      0x55e776ff0779 /opt/tritonserver/backends/python/triton_python_backend_stub(+0x6c779) [0x55e776ff0779]\n27      0x55e776fda756 /opt/tritonserver/backends/python/triton_python_backend_stub(+0x56756) [0x55e776fda756]\n28      0x55e776fe4211 /opt/tritonserver/backends/python/triton_python_backend_stub(+0x60211) [0x55e776fe4211]\n29      0x55e776fac9f0 /opt/tritonserver/backends/python/triton_python_backend_stub(+0x289f0) [0x55e776fac9f0]\n30      0x7f04c61fdd90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f04c61fdd90]\n31      0x7f04c61fde40 __libc_start_main + 128\n32      0x55e776fad8d5 /opt/tritonserver/backends/python/triton_python_backend_stub(+0x298d5) [0x55e776fad8d5]\n\nAt:\n  /usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py(203): from_dir\n  /workspace/model_repo_whisper/whisper/1/model.py(56): initialize\n"
asr-1  | E0120 22:47:46.156839 2634 model_lifecycle.cc:642] "failed to load 'whisper' version 1: Internal: RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: maxTokensInPagedKvCache (1856) must be large enough to process at least 1 sequence to completion (i.e. must be larger than beam_width (1) * tokensPerBlock (64) * maxBlocksPerSeq (47)) (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:769)\n1       0x7f03651c6cd7 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 82\n2       0x7f036520780c /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x7a680c) [0x7f036520780c]\n3       0x7f03674100b1 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::createKvCacheManager(tensorrt_llm::batch_manager::kv_cache_manager::KvCacheConfig const&, tensorrt_llm::batch_manager::kv_cache_manager::CacheType) + 1377\n4       0x7f0367411295 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::runtime::RawEngine const&, bool, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 3061\n5       0x7f03673bc6e4 tensorrt_llm::batch_manager::TrtGptModelFactory::create(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::batch_manager::TrtGptModelType, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 756\n6       0x7f036744296d tensorrt_llm::executor::Executor::Impl::createModel(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 125\n7       0x7f0367442f2a tensorrt_llm::executor::Executor::Impl::loadModel(std::optional<std::filesystem::path> const&, std::optional<std::basic_string_view<unsigned char, std::char_traits<unsigned char> > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool, std::optional<std::map<std::string, tensorrt_llm::executor::Tensor, std::less<std::string>, std::allocator<std::pair<std::string const, tensorrt_llm::executor::Tensor> > > > const&) + 954\n8       0x7f0367443f67 tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::path const&, std::optional<std::filesystem::path> const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 2135\n9       0x7f036742f733 tensorrt_llm::executor::Executor::Executor(std::filesystem::path const&, std::filesystem::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 99\n10      0x7f03d5e76419 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xdd419) [0x7f03d5e76419]\n11      0x7f03d5df992f /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x6092f) [0x7f03d5df992f]\n12      0x7f04c6751023 /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0x128023) [0x7f04c6751023]\n13      0x7f04c6708adc _PyObject_MakeTpCall + 140\n14      0x7f04c670b43a /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0xe243a) [0x7f04c670b43a]\n15      0x7f04c6778172 /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0x14f172) [0x7f04c6778172]\n16      0x7f04c676c88e /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0x14388e) [0x7f04c676c88e]\n17      0x7f03d5df375b /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x5a75b) [0x7f03d5df375b]\n18      0x7f04c6708adc _PyObject_MakeTpCall + 140\n19      0x7f04c66a4a1c _PyEval_EvalFrameDefault + 40380\n20      0x7f04c67eb3af /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0x1c23af) [0x7f04c67eb3af]\n21      0x7f04c670b3d8 /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0xe23d8) [0x7f04c670b3d8]\n22      0x7f04c670aed8 PyVectorcall_Call + 168\n23      0x7f04c669f776 _PyEval_EvalFrameDefault + 19222\n24      0x7f04c67eb3af /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0x1c23af) [0x7f04c67eb3af]\n25      0x7f04c670b358 /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0xe2358) [0x7f04c670b358]\n26      0x55e776ff0779 /opt/tritonserver/backends/python/triton_python_backend_stub(+0x6c779) [0x55e776ff0779]\n27      0x55e776fda756 /opt/tritonserver/backends/python/triton_python_backend_stub(+0x56756) [0x55e776fda756]\n28      0x55e776fe4211 /opt/tritonserver/backends/python/triton_python_backend_stub(+0x60211) [0x55e776fe4211]\n29      0x55e776fac9f0 /opt/tritonserver/backends/python/triton_python_backend_stub(+0x289f0) [0x55e776fac9f0]\n30      0x7f04c61fdd90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f04c61fdd90]\n31      0x7f04c61fde40 __libc_start_main + 128\n32      0x55e776fad8d5 /opt/tritonserver/backends/python/triton_python_backend_stub(+0x298d5) [0x55e776fad8d5]\n\nAt:\n  /usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py(203): from_dir\n  /workspace/model_repo_whisper/whisper/1/model.py(56): initialize\n"
asr-1  | I0120 22:47:46.156889 2634 model_lifecycle.cc:777] "failed to load 'whisper'"```


There seems to be something wrong with param `maxTokensInPagedKvCache`
I tried model large-v3 is not work.
work only large-v3-turbo

yuekaizhang · 2025-01-21T01:05:49Z

@sh1man999 I believe your issue was caused by running out of GPU memory on the 4080. While you might get large-v3 to work by adjusting some parameters during building, such as decreasing maxBlocksPerSeq-related settings, I recommend using Whisper Large Turbo on GPUs like the 4080 or NVIDIA Jetson. You may also try the int8 weight only quant to reduce VRAM usage.

sh1man999 · 2025-01-21T20:42:22Z

Where do I find the maxBlocksPerSeq settings is a whisper/config.pbtxt ?

sasatte changed the title ~~Whisper triton server inference is failing with : Assertion failed: Encoder tokens are not given~~ Whisper triton server inference on jetson is failing with : Assertion failed: Encoder tokens are not given Dec 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whisper triton server inference on jetson is failing with : Assertion failed: Encoder tokens are not given #688

Whisper triton server inference on jetson is failing with : Assertion failed: Encoder tokens are not given #688

sasatte commented Dec 29, 2024 •

edited

Loading

yuekaizhang commented Jan 2, 2025

sh1man999 commented Jan 20, 2025 •

edited

Loading

yuekaizhang commented Jan 21, 2025

sh1man999 commented Jan 21, 2025 •

edited

Loading

Whisper triton server inference on jetson is failing with : Assertion failed: Encoder tokens are not given #688

Whisper triton server inference on jetson is failing with : Assertion failed: Encoder tokens are not given #688

Comments

sasatte commented Dec 29, 2024 • edited Loading

yuekaizhang commented Jan 2, 2025

sh1man999 commented Jan 20, 2025 • edited Loading

yuekaizhang commented Jan 21, 2025

sh1man999 commented Jan 21, 2025 • edited Loading

sasatte commented Dec 29, 2024 •

edited

Loading

sh1man999 commented Jan 20, 2025 •

edited

Loading

sh1man999 commented Jan 21, 2025 •

edited

Loading