Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whisper triton server inference on jetson is failing with : Assertion failed: Encoder tokens are not given #688

Open
sasatte opened this issue Dec 29, 2024 · 4 comments

Comments

@sasatte
Copy link

sasatte commented Dec 29, 2024

Following instructions here https://github.com/k2-fsa/sherpa/tree/master/triton/whisper for Jetson orin NX. All went well but the inference is failing with following excecption.

[TensorRT-LLM][ERROR] Encountered an error when fetching new request: [TensorRT-LLM][ERROR] Assertion failed: Encoder tokens are not given (/home/nvidia/disk/workspace/jetson_llm/tekit/cpp/include/tensorrt_llm/batch_manager/llmRequest.h:353)
1 0xfffe90dc11f8 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 104
2 0xfffe92133920 tensorrt_llm::executor::Executor::Impl::fetchNewRequests[abi:cxx11](int, std::optional) + 4128
3 0xfffe921348bc tensorrt_llm::executor::Executor::Impl::executionLoop() + 652
4 0xffff96e231fc /lib/aarch64-linux-gnu/libstdc++.so.6(+0xd31fc) [0xffff96e231fc]
5 0xffff96bed5c8 /lib/aarch64-linux-gnu/libc.so.6(+0x7d5c8) [0xffff96bed5c8]
6 0xffff96c55edc /lib/aarch64-linux-gnu/libc.so.6(+0xe5edc) [0xffff96c55edc]
E1229 08:04:47.723718 1454721 pb_stub.cc:714] "Failed to process the request(s) for model 'whisper_0_0', message: RuntimeError: Encountered an error when fetching new request: [TensorRT-LLM][ERROR] Assertion failed: Encoder tokens are not given (/home/nvidia/disk/workspace/jetson_llm/tekit/cpp/include/tensorrt_llm/batch_manager/llmRequest.h:353)\n1 0xfffe90dc11f8 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 104\n2 0xfffe92133920 tensorrt_llm::executor::Executor::Impl::fetchNewRequests[abi:cxx11](int, std::optional) + 4128\n3 0xfffe921348bc tensorrt_llm::executor::Executor::Impl::executionLoop() + 652\n4 0xffff96e231fc /lib/aarch64-linux-gnu/libstdc++.so.6(+0xd31fc) [0xffff96e231fc]\n5 0xffff96bed5c8 /lib/aarch64-linux-gnu/libc.so.6(+0x7d5c8) [0xffff96bed5c8]\n6 0xffff96c55edc /lib/aarch64-linux-gnu/libc.so.6(+0xe5edc) [0xffff96c55edc]\n\nAt:\n /home/psasatte/.local/lib/python3.10/site-packages/tensorrt_llm/runtime/model_runner_cpp.py(578): _fill_output\n /home/psasatte/.local/lib/python3.10/site-packages/tensorrt_llm/runtime/model_runner_cpp.py(540): _initialize_and_fill_output\n /home/psasatte/.local/lib/python3.10/site-packages/tensorrt_llm/runtime/model_runner_cpp.py(485): generate\n /home/psasatte/TensorRT-LLM/examples/whisper/sherpa/triton/whisper/./model_repo_whisper/whisper/1/model.py(103): execute\n"
E1229 08:04:47.724023 1454721 pb_stub.cc:714] "Failed to process the request(s) for model 'infer_bls_0_1', message: TritonModelException: Failed to open the cudaIpcHandle. error: invalid argument\n\nAt:\n /home/psasatte/TensorRT-LLM/examples/whisper/sherpa/triton/whisper/./model_repo_whisper/infer_bls/1/model.py(75): process_batch\n /home/psasatte/TensorRT-LLM/examples/whisper/sherpa/triton/whisper/./model_repo_whisper/infer_bls/1/model.py(106): execute\n"

@sasatte sasatte changed the title Whisper triton server inference is failing with : Assertion failed: Encoder tokens are not given Whisper triton server inference on jetson is failing with : Assertion failed: Encoder tokens are not given Dec 29, 2024
@yuekaizhang
Copy link
Collaborator

@sasatte We didn't test on Jetson. You may also share more info about your setting e.g. docker image, trt-llm version

  1. Can whisper triton server work on normal GPUs with your setting?
  2. Can you make other TRT-LLM recipe e.g. llama, qwen work on Jetson?

@sh1man999
Copy link

sh1man999 commented Jan 20, 2025

Me too.
I'm using a regular pc with a 4080 video card.

asr-1  | E0120 22:47:46.156718 2634 backend_model.cc:692] "ERROR: Failed to create instance: RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: maxTokensInPagedKvCache (1856) must be large enough to process at least 1 sequence to completion (i.e. must be larger than beam_width (1) * tokensPerBlock (64) * maxBlocksPerSeq (47)) (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:769)\n1       0x7f03651c6cd7 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 82\n2       0x7f036520780c /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x7a680c) [0x7f036520780c]\n3       0x7f03674100b1 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::createKvCacheManager(tensorrt_llm::batch_manager::kv_cache_manager::KvCacheConfig const&, tensorrt_llm::batch_manager::kv_cache_manager::CacheType) + 1377\n4       0x7f0367411295 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::runtime::RawEngine const&, bool, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 3061\n5       0x7f03673bc6e4 tensorrt_llm::batch_manager::TrtGptModelFactory::create(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::batch_manager::TrtGptModelType, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 756\n6       0x7f036744296d tensorrt_llm::executor::Executor::Impl::createModel(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 125\n7       0x7f0367442f2a tensorrt_llm::executor::Executor::Impl::loadModel(std::optional<std::filesystem::path> const&, std::optional<std::basic_string_view<unsigned char, std::char_traits<unsigned char> > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool, std::optional<std::map<std::string, tensorrt_llm::executor::Tensor, std::less<std::string>, std::allocator<std::pair<std::string const, tensorrt_llm::executor::Tensor> > > > const&) + 954\n8       0x7f0367443f67 tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::path const&, std::optional<std::filesystem::path> const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 2135\n9       0x7f036742f733 tensorrt_llm::executor::Executor::Executor(std::filesystem::path const&, std::filesystem::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 99\n10      0x7f03d5e76419 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xdd419) [0x7f03d5e76419]\n11      0x7f03d5df992f /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x6092f) [0x7f03d5df992f]\n12      0x7f04c6751023 /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0x128023) [0x7f04c6751023]\n13      0x7f04c6708adc _PyObject_MakeTpCall + 140\n14      0x7f04c670b43a /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0xe243a) [0x7f04c670b43a]\n15      0x7f04c6778172 /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0x14f172) [0x7f04c6778172]\n16      0x7f04c676c88e /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0x14388e) [0x7f04c676c88e]\n17      0x7f03d5df375b /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x5a75b) [0x7f03d5df375b]\n18      0x7f04c6708adc _PyObject_MakeTpCall + 140\n19      0x7f04c66a4a1c _PyEval_EvalFrameDefault + 40380\n20      0x7f04c67eb3af /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0x1c23af) [0x7f04c67eb3af]\n21      0x7f04c670b3d8 /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0xe23d8) [0x7f04c670b3d8]\n22      0x7f04c670aed8 PyVectorcall_Call + 168\n23      0x7f04c669f776 _PyEval_EvalFrameDefault + 19222\n24      0x7f04c67eb3af /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0x1c23af) [0x7f04c67eb3af]\n25      0x7f04c670b358 /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0xe2358) [0x7f04c670b358]\n26      0x55e776ff0779 /opt/tritonserver/backends/python/triton_python_backend_stub(+0x6c779) [0x55e776ff0779]\n27      0x55e776fda756 /opt/tritonserver/backends/python/triton_python_backend_stub(+0x56756) [0x55e776fda756]\n28      0x55e776fe4211 /opt/tritonserver/backends/python/triton_python_backend_stub(+0x60211) [0x55e776fe4211]\n29      0x55e776fac9f0 /opt/tritonserver/backends/python/triton_python_backend_stub(+0x289f0) [0x55e776fac9f0]\n30      0x7f04c61fdd90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f04c61fdd90]\n31      0x7f04c61fde40 __libc_start_main + 128\n32      0x55e776fad8d5 /opt/tritonserver/backends/python/triton_python_backend_stub(+0x298d5) [0x55e776fad8d5]\n\nAt:\n  /usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py(203): from_dir\n  /workspace/model_repo_whisper/whisper/1/model.py(56): initialize\n"
asr-1  | E0120 22:47:46.156839 2634 model_lifecycle.cc:642] "failed to load 'whisper' version 1: Internal: RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: maxTokensInPagedKvCache (1856) must be large enough to process at least 1 sequence to completion (i.e. must be larger than beam_width (1) * tokensPerBlock (64) * maxBlocksPerSeq (47)) (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:769)\n1       0x7f03651c6cd7 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 82\n2       0x7f036520780c /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x7a680c) [0x7f036520780c]\n3       0x7f03674100b1 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::createKvCacheManager(tensorrt_llm::batch_manager::kv_cache_manager::KvCacheConfig const&, tensorrt_llm::batch_manager::kv_cache_manager::CacheType) + 1377\n4       0x7f0367411295 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::runtime::RawEngine const&, bool, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 3061\n5       0x7f03673bc6e4 tensorrt_llm::batch_manager::TrtGptModelFactory::create(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::batch_manager::TrtGptModelType, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 756\n6       0x7f036744296d tensorrt_llm::executor::Executor::Impl::createModel(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 125\n7       0x7f0367442f2a tensorrt_llm::executor::Executor::Impl::loadModel(std::optional<std::filesystem::path> const&, std::optional<std::basic_string_view<unsigned char, std::char_traits<unsigned char> > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool, std::optional<std::map<std::string, tensorrt_llm::executor::Tensor, std::less<std::string>, std::allocator<std::pair<std::string const, tensorrt_llm::executor::Tensor> > > > const&) + 954\n8       0x7f0367443f67 tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::path const&, std::optional<std::filesystem::path> const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 2135\n9       0x7f036742f733 tensorrt_llm::executor::Executor::Executor(std::filesystem::path const&, std::filesystem::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 99\n10      0x7f03d5e76419 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xdd419) [0x7f03d5e76419]\n11      0x7f03d5df992f /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x6092f) [0x7f03d5df992f]\n12      0x7f04c6751023 /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0x128023) [0x7f04c6751023]\n13      0x7f04c6708adc _PyObject_MakeTpCall + 140\n14      0x7f04c670b43a /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0xe243a) [0x7f04c670b43a]\n15      0x7f04c6778172 /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0x14f172) [0x7f04c6778172]\n16      0x7f04c676c88e /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0x14388e) [0x7f04c676c88e]\n17      0x7f03d5df375b /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x5a75b) [0x7f03d5df375b]\n18      0x7f04c6708adc _PyObject_MakeTpCall + 140\n19      0x7f04c66a4a1c _PyEval_EvalFrameDefault + 40380\n20      0x7f04c67eb3af /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0x1c23af) [0x7f04c67eb3af]\n21      0x7f04c670b3d8 /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0xe23d8) [0x7f04c670b3d8]\n22      0x7f04c670aed8 PyVectorcall_Call + 168\n23      0x7f04c669f776 _PyEval_EvalFrameDefault + 19222\n24      0x7f04c67eb3af /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0x1c23af) [0x7f04c67eb3af]\n25      0x7f04c670b358 /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0(+0xe2358) [0x7f04c670b358]\n26      0x55e776ff0779 /opt/tritonserver/backends/python/triton_python_backend_stub(+0x6c779) [0x55e776ff0779]\n27      0x55e776fda756 /opt/tritonserver/backends/python/triton_python_backend_stub(+0x56756) [0x55e776fda756]\n28      0x55e776fe4211 /opt/tritonserver/backends/python/triton_python_backend_stub(+0x60211) [0x55e776fe4211]\n29      0x55e776fac9f0 /opt/tritonserver/backends/python/triton_python_backend_stub(+0x289f0) [0x55e776fac9f0]\n30      0x7f04c61fdd90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f04c61fdd90]\n31      0x7f04c61fde40 __libc_start_main + 128\n32      0x55e776fad8d5 /opt/tritonserver/backends/python/triton_python_backend_stub(+0x298d5) [0x55e776fad8d5]\n\nAt:\n  /usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py(203): from_dir\n  /workspace/model_repo_whisper/whisper/1/model.py(56): initialize\n"
asr-1  | I0120 22:47:46.156889 2634 model_lifecycle.cc:777] "failed to load 'whisper'"```


There seems to be something wrong with param `maxTokensInPagedKvCache`
I tried model large-v3 is not work.
work only large-v3-turbo

@yuekaizhang
Copy link
Collaborator

@sh1man999 I believe your issue was caused by running out of GPU memory on the 4080. While you might get large-v3 to work by adjusting some parameters during building, such as decreasing maxBlocksPerSeq-related settings, I recommend using Whisper Large Turbo on GPUs like the 4080 or NVIDIA Jetson. You may also try the int8 weight only quant to reduce VRAM usage.

@sh1man999
Copy link

sh1man999 commented Jan 21, 2025

Where do I find the maxBlocksPerSeq settings is a whisper/config.pbtxt ?

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants