Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracing llama3.1405b using tracy profiler for HIP/MI300X hits an assertion #19571

Open
kumardeepakamd opened this issue Dec 30, 2024 · 1 comment
Labels
bug 🐞 Something isn't working

Comments

@kumardeepakamd
Copy link
Contributor

What happened?

When I try to capture tracy profile for 405B on a MI300X server (HIP), it hits following assertion:

iree/third_party/tracy/server/TracyWorker.cpp:5804: void tracy::Worker::ProcessGpuZoneEnd(const tracy::QueueGpuZoneEnd &, bool): Assertion `!ctx->query[ev.queryId]' failed.

Steps to reproduce your issue

I built iree source code with Tracy tracing enabled as:

cmake -G Ninja -B ~/iree-build-trace -S . -DCMAKE_BUILD_TYPE=RelWithDebInfo
-DIREE_ENABLE_ASSERTIONS=ON -DCMAKE_C_COMPILER=clang
-DCMAKE_CXX_COMPILER=clang++ -DIREE_ENABLE_RUNTIME_TRACING=ON
-DIREE_BUILD_TRACY=ON -DIREE_ENABLE_LLD=ON
-DIREE_BUILD_PYTHON_BINDINGS=ON
-DPython3_EXECUTABLE="$(which python3)"
-DIREE_TARGET_BACKEND_CUDA=OFF -DIREE_HAL_DRIVER_HIP=ON
-DIREE_TARGET_BACKEND_ROCM=ON .

cmake --build ~/iree-build-trace

Then compiled the Tensor Parallel 8 (TP8) sharded IR for llama3.1 405B as below:

~/iree-build-trace/tools/iree-compile --compile-to=input
artifacts/405b_f16_prefill_tp8_nondecomposed.mlir
-o artifacts/405b_f16_prefill_tp8_nondecomposed.iree.mlir

~/iree-build-trace/tools/iree-compile
artifacts/405b_f16_prefill_tp8_nondecomposed.iree.mlir
--iree-hip-target=gfx942
--iree-hal-target-device=hip[0]
--iree-hal-target-device=hip[1]
--iree-hal-target-device=hip[2]
--iree-hal-target-device=hip[3]
--iree-hal-target-device=hip[4]
--iree-hal-target-device=hip[5]
--iree-hal-target-device=hip[6]
--iree-hal-target-device=hip[7]
--iree-dispatch-creation-enable-aggressive-fusion=true
--iree-global-opt-propagate-transposes=true
--iree-opt-aggressively-propagate-transposes=true
--iree-opt-data-tiling=false
--iree-preprocessing-pass-pipeline='builtin.module(util.func(iree-preprocessing-generalize-linalg-matmul-experimental))'
--iree-hal-indirect-command-buffers=true
--iree-stream-resource-memory-model=discrete
--iree-hip-legacy-sync=false
--iree-hal-memoization=true
--iree-opt-strip-assertions
--iree-hal-executable-debug-level=3
--iree-hal-dump-executable-sources-to=dump
--mlir-print-debuginfo
-o=artifacts/prefill_405b_tp8_tracy.vmfb

And then I collect the tracy profile as:

Run in first terminal as:
~/iree-build-trace/tracy/iree-tracy-capture -f -o llama3.1_405b_tp8_fp16_prefill.tracy

Run in another terminal on the same server:

TRACY_PORT=8086 TRACY_NO_EXIT=1 ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
~/iree-build-trace/tools/iree-run-module -run-module --hip_use_streams=true
--module=artifacts/prefill_405b_tp8_tracy.vmfb
--parameters=model=llama3.1_405b_fp16_tp8_parameters.irpa
--parameters=model=llama3.1_405b_fp16_tp8_parameters.rank0.irpa
--parameters=model=llama3.1_405b_fp16_tp8_parameters.rank1.irpa
--parameters=model=llama3.1_405b_fp16_tp8_parameters.rank2.irpa
--parameters=model=llama3.1_405b_fp16_tp8_parameters.rank3.irpa
--parameters=model=llama3.1_405b_fp16_tp8_parameters.rank4.irpa
--parameters=model=llama3.1_405b_fp16_tp8_parameters.rank5.irpa
--parameters=model=llama3.1_405b_fp16_tp8_parameters.rank6.irpa
--parameters=model=llama3.1_405b_fp16_tp8_parameters.rank7.irpa
--device=hip://0 --device=hip://1 --device=hip://2 --device=hip://3
--device=hip://4 --device=hip://5 --device=hip://6 --device=hip://7
--function=prefill_bs4 --input=@weights/405b/prefill_args_bs4_128/random_tokens.npy
--input=@prefill_args_bs4_128/seq_lens.npy
--input=@prefill_args_bs4_128/seq_block_ids.npy
--input=@prefill_args_bs4_128/cs_f16_shard_0.npy
--input=@prefill_args_bs4_128/cs_f16_shard_1.npy
--input=@prefill_args_bs4_128/cs_f16_shard_2.npy
--input=@prefill_args_bs4_128/cs_f16_shard_3.npy
--input=@prefill_args_bs4_128/cs_f16_shard_4.npy
--input=@prefill_args_bs4_128/cs_f16_shard_5.npy
--input=@prefill_args_bs4_128/cs_f16_shard_6.npy
--input=@prefill_args_bs4_128/cs_f16_shard_7.npy

prefill_args_bs4_128.zip

What component(s) does this issue relate to?

Compiler, Runtime

Version information

No response

Additional context

No response

@kumardeepakamd kumardeepakamd added the bug 🐞 Something isn't working label Dec 30, 2024
@AWoloszyn
Copy link
Contributor

iree/third_party/tracy/server/TracyWorker.cpp:5804: void tracy::Worker::ProcessGpuZoneEnd(const tracy::QueueGpuZoneEnd &, bool): Assertion !ctx->query[ev.queryId]' failed.`

That looks like a query being re-used before being reported properly to Tracy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐞 Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants