We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When I try to capture tracy profile for 405B on a MI300X server (HIP), it hits following assertion:
iree/third_party/tracy/server/TracyWorker.cpp:5804: void tracy::Worker::ProcessGpuZoneEnd(const tracy::QueueGpuZoneEnd &, bool): Assertion `!ctx->query[ev.queryId]' failed.
I built iree source code with Tracy tracing enabled as:
cmake -G Ninja -B ~/iree-build-trace -S . -DCMAKE_BUILD_TYPE=RelWithDebInfo -DIREE_ENABLE_ASSERTIONS=ON -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DIREE_ENABLE_RUNTIME_TRACING=ON -DIREE_BUILD_TRACY=ON -DIREE_ENABLE_LLD=ON -DIREE_BUILD_PYTHON_BINDINGS=ON -DPython3_EXECUTABLE="$(which python3)" -DIREE_TARGET_BACKEND_CUDA=OFF -DIREE_HAL_DRIVER_HIP=ON -DIREE_TARGET_BACKEND_ROCM=ON .
cmake --build ~/iree-build-trace
Then compiled the Tensor Parallel 8 (TP8) sharded IR for llama3.1 405B as below:
~/iree-build-trace/tools/iree-compile --compile-to=input artifacts/405b_f16_prefill_tp8_nondecomposed.mlir -o artifacts/405b_f16_prefill_tp8_nondecomposed.iree.mlir
~/iree-build-trace/tools/iree-compile artifacts/405b_f16_prefill_tp8_nondecomposed.iree.mlir --iree-hip-target=gfx942 --iree-hal-target-device=hip[0] --iree-hal-target-device=hip[1] --iree-hal-target-device=hip[2] --iree-hal-target-device=hip[3] --iree-hal-target-device=hip[4] --iree-hal-target-device=hip[5] --iree-hal-target-device=hip[6] --iree-hal-target-device=hip[7] --iree-dispatch-creation-enable-aggressive-fusion=true --iree-global-opt-propagate-transposes=true --iree-opt-aggressively-propagate-transposes=true --iree-opt-data-tiling=false --iree-preprocessing-pass-pipeline='builtin.module(util.func(iree-preprocessing-generalize-linalg-matmul-experimental))' --iree-hal-indirect-command-buffers=true --iree-stream-resource-memory-model=discrete --iree-hip-legacy-sync=false --iree-hal-memoization=true --iree-opt-strip-assertions --iree-hal-executable-debug-level=3 --iree-hal-dump-executable-sources-to=dump --mlir-print-debuginfo -o=artifacts/prefill_405b_tp8_tracy.vmfb
And then I collect the tracy profile as:
Run in first terminal as: ~/iree-build-trace/tracy/iree-tracy-capture -f -o llama3.1_405b_tp8_fp16_prefill.tracy
Run in another terminal on the same server:
TRACY_PORT=8086 TRACY_NO_EXIT=1 ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ~/iree-build-trace/tools/iree-run-module -run-module --hip_use_streams=true --module=artifacts/prefill_405b_tp8_tracy.vmfb --parameters=model=llama3.1_405b_fp16_tp8_parameters.irpa --parameters=model=llama3.1_405b_fp16_tp8_parameters.rank0.irpa --parameters=model=llama3.1_405b_fp16_tp8_parameters.rank1.irpa --parameters=model=llama3.1_405b_fp16_tp8_parameters.rank2.irpa --parameters=model=llama3.1_405b_fp16_tp8_parameters.rank3.irpa --parameters=model=llama3.1_405b_fp16_tp8_parameters.rank4.irpa --parameters=model=llama3.1_405b_fp16_tp8_parameters.rank5.irpa --parameters=model=llama3.1_405b_fp16_tp8_parameters.rank6.irpa --parameters=model=llama3.1_405b_fp16_tp8_parameters.rank7.irpa --device=hip://0 --device=hip://1 --device=hip://2 --device=hip://3 --device=hip://4 --device=hip://5 --device=hip://6 --device=hip://7 --function=prefill_bs4 --input=@weights/405b/prefill_args_bs4_128/random_tokens.npy --input=@prefill_args_bs4_128/seq_lens.npy --input=@prefill_args_bs4_128/seq_block_ids.npy --input=@prefill_args_bs4_128/cs_f16_shard_0.npy --input=@prefill_args_bs4_128/cs_f16_shard_1.npy --input=@prefill_args_bs4_128/cs_f16_shard_2.npy --input=@prefill_args_bs4_128/cs_f16_shard_3.npy --input=@prefill_args_bs4_128/cs_f16_shard_4.npy --input=@prefill_args_bs4_128/cs_f16_shard_5.npy --input=@prefill_args_bs4_128/cs_f16_shard_6.npy --input=@prefill_args_bs4_128/cs_f16_shard_7.npy
prefill_args_bs4_128.zip
Compiler, Runtime
No response
The text was updated successfully, but these errors were encountered:
iree/third_party/tracy/server/TracyWorker.cpp:5804: void tracy::Worker::ProcessGpuZoneEnd(const tracy::QueueGpuZoneEnd &, bool): Assertion !ctx->query[ev.queryId]' failed.`
iree/third_party/tracy/server/TracyWorker.cpp:5804: void tracy::Worker::ProcessGpuZoneEnd(const tracy::QueueGpuZoneEnd &, bool): Assertion
That looks like a query being re-used before being reported properly to Tracy.
Sorry, something went wrong.
No branches or pull requests
What happened?
When I try to capture tracy profile for 405B on a MI300X server (HIP), it hits following assertion:
iree/third_party/tracy/server/TracyWorker.cpp:5804: void tracy::Worker::ProcessGpuZoneEnd(const tracy::QueueGpuZoneEnd &, bool): Assertion `!ctx->query[ev.queryId]' failed.
Steps to reproduce your issue
I built iree source code with Tracy tracing enabled as:
cmake -G Ninja -B ~/iree-build-trace -S . -DCMAKE_BUILD_TYPE=RelWithDebInfo
-DIREE_ENABLE_ASSERTIONS=ON -DCMAKE_C_COMPILER=clang
-DCMAKE_CXX_COMPILER=clang++ -DIREE_ENABLE_RUNTIME_TRACING=ON
-DIREE_BUILD_TRACY=ON -DIREE_ENABLE_LLD=ON
-DIREE_BUILD_PYTHON_BINDINGS=ON
-DPython3_EXECUTABLE="$(which python3)"
-DIREE_TARGET_BACKEND_CUDA=OFF -DIREE_HAL_DRIVER_HIP=ON
-DIREE_TARGET_BACKEND_ROCM=ON .
cmake --build ~/iree-build-trace
Then compiled the Tensor Parallel 8 (TP8) sharded IR for llama3.1 405B as below:
~/iree-build-trace/tools/iree-compile --compile-to=input
artifacts/405b_f16_prefill_tp8_nondecomposed.mlir
-o artifacts/405b_f16_prefill_tp8_nondecomposed.iree.mlir
~/iree-build-trace/tools/iree-compile
artifacts/405b_f16_prefill_tp8_nondecomposed.iree.mlir
--iree-hip-target=gfx942
--iree-hal-target-device=hip[0]
--iree-hal-target-device=hip[1]
--iree-hal-target-device=hip[2]
--iree-hal-target-device=hip[3]
--iree-hal-target-device=hip[4]
--iree-hal-target-device=hip[5]
--iree-hal-target-device=hip[6]
--iree-hal-target-device=hip[7]
--iree-dispatch-creation-enable-aggressive-fusion=true
--iree-global-opt-propagate-transposes=true
--iree-opt-aggressively-propagate-transposes=true
--iree-opt-data-tiling=false
--iree-preprocessing-pass-pipeline='builtin.module(util.func(iree-preprocessing-generalize-linalg-matmul-experimental))'
--iree-hal-indirect-command-buffers=true
--iree-stream-resource-memory-model=discrete
--iree-hip-legacy-sync=false
--iree-hal-memoization=true
--iree-opt-strip-assertions
--iree-hal-executable-debug-level=3
--iree-hal-dump-executable-sources-to=dump
--mlir-print-debuginfo
-o=artifacts/prefill_405b_tp8_tracy.vmfb
And then I collect the tracy profile as:
Run in first terminal as:
~/iree-build-trace/tracy/iree-tracy-capture -f -o llama3.1_405b_tp8_fp16_prefill.tracy
Run in another terminal on the same server:
TRACY_PORT=8086 TRACY_NO_EXIT=1 ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
~/iree-build-trace/tools/iree-run-module -run-module --hip_use_streams=true
--module=artifacts/prefill_405b_tp8_tracy.vmfb
--parameters=model=llama3.1_405b_fp16_tp8_parameters.irpa
--parameters=model=llama3.1_405b_fp16_tp8_parameters.rank0.irpa
--parameters=model=llama3.1_405b_fp16_tp8_parameters.rank1.irpa
--parameters=model=llama3.1_405b_fp16_tp8_parameters.rank2.irpa
--parameters=model=llama3.1_405b_fp16_tp8_parameters.rank3.irpa
--parameters=model=llama3.1_405b_fp16_tp8_parameters.rank4.irpa
--parameters=model=llama3.1_405b_fp16_tp8_parameters.rank5.irpa
--parameters=model=llama3.1_405b_fp16_tp8_parameters.rank6.irpa
--parameters=model=llama3.1_405b_fp16_tp8_parameters.rank7.irpa
--device=hip://0 --device=hip://1 --device=hip://2 --device=hip://3
--device=hip://4 --device=hip://5 --device=hip://6 --device=hip://7
--function=prefill_bs4 --input=@weights/405b/prefill_args_bs4_128/random_tokens.npy
--input=@prefill_args_bs4_128/seq_lens.npy
--input=@prefill_args_bs4_128/seq_block_ids.npy
--input=@prefill_args_bs4_128/cs_f16_shard_0.npy
--input=@prefill_args_bs4_128/cs_f16_shard_1.npy
--input=@prefill_args_bs4_128/cs_f16_shard_2.npy
--input=@prefill_args_bs4_128/cs_f16_shard_3.npy
--input=@prefill_args_bs4_128/cs_f16_shard_4.npy
--input=@prefill_args_bs4_128/cs_f16_shard_5.npy
--input=@prefill_args_bs4_128/cs_f16_shard_6.npy
--input=@prefill_args_bs4_128/cs_f16_shard_7.npy
prefill_args_bs4_128.zip
What component(s) does this issue relate to?
Compiler, Runtime
Version information
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: