405B prefill tp8 sharded: iree/runtime/src/iree/hal/drivers/hip/hip_buffer.c:33: iree_hal_hip_buffer_t iree_hal_hip_buffer_cast(iree_hal_buffer_t ): Assertion `!!(iree_hal_resource_is(base_value, &iree_hal_hip_buffer_vtable))' failed #19573

pdhirajkumarprasad · 2024-12-31T09:05:14Z

What happened?

Getting error during iree-run-module/iree-benchmark-module for 405B-FP16-prefill-tp8-sharded:

iree-run-module: iree/runtime/src/iree/hal/drivers/hip/hip_buffer.c:33: iree_hal_hip_buffer_t *iree_hal_hip_buffer_cast(iree_hal_buffer_t *): Assertion `!!(iree_hal_resource_is(base_value, &iree_hal_hip_buffer_vtable))' failed.
Abort (core dumped)

build: a43d893

commands:

python3 -m sharktank.examples.export_paged_llm_v1 --bs=4 --irpa-file=/data/llama3.1/weights/405b/fp16/tp8/llama3.1_405b_fp16_tp8_parameters.irpa --output-mlir=405b_f16_prefill_tp8_nondecomposed.mlir --output-config=405b_f16_prefill_tp8_nondecomposed.json --skip-decode

iree-compile 405b_f16_prefill_tp8_nondecomposed.mlir --iree-hip-target=gfx942 -o=405b_prefill_sharded.vmfb --iree-hal-target-device="hip[0]" --iree-hal-target-device="hip[1]" --iree-hal-target-device="hip[2]" --iree-hal-target-device="hip[3]" --iree-hal-target-device="hip[4]" --iree-hal-target-device="hip[5]" --iree-hal-target-device="hip[6]" --iree-hal-target-device="hip[7]" --iree-dispatch-creation-enable-aggressive-fusion=true --iree-global-opt-propagate-transposes=true --iree-opt-aggressively-propagate-transposes=true --iree-opt-data-tiling=false --iree-preprocessing-pass-pipeline='builtin.module(util.func(iree-preprocessing-generalize-linalg-matmul-experimental))' --iree-hal-indirect-command-buffers=true --iree-stream-resource-memory-model=discrete --iree-hip-legacy-sync=false --iree-hal-memoization=true --iree-opt-strip-assertions

iree-benchmark-module --hip_use_streams=true --module=405b_prefill_sharded.vmfb --parameters=model=/home/sai/temp_dir_by_dhiraj/gitRepo/shark-ai/rerun_31st_dec/instruct/405b_fp16_tp8.irpa --parameters=model=/home/sai/temp_dir_by_dhiraj/gitRepo/shark-ai/rerun_31st_dec/instruct/405b_fp16_tp8.rank0.irpa --parameters=model=/home/sai/temp_dir_by_dhiraj/gitRepo/shark-ai/rerun_31st_dec/instruct/405b_fp16_tp8.rank1.irpa --parameters=model=/home/sai/temp_dir_by_dhiraj/gitRepo/shark-ai/rerun_31st_dec/instruct/405b_fp16_tp8.rank2.irpa --parameters=model=/home/sai/temp_dir_by_dhiraj/gitRepo/shark-ai/rerun_31st_dec/instruct/405b_fp16_tp8.rank3.irpa --parameters=model=/home/sai/temp_dir_by_dhiraj/gitRepo/shark-ai/rerun_31st_dec/instruct/405b_fp16_tp8.rank4.irpa --parameters=model=/home/sai/temp_dir_by_dhiraj/gitRepo/shark-ai/rerun_31st_dec/instruct/405b_fp16_tp8.rank5.irpa --parameters=model=/home/sai/temp_dir_by_dhiraj/gitRepo/shark-ai/rerun_31st_dec/instruct/405b_fp16_tp8.rank6.irpa --parameters=model=/home/sai/temp_dir_by_dhiraj/gitRepo/shark-ai/rerun_31st_dec/instruct/405b_fp16_tp8.rank7.irpa --device=hip://0 --device=hip://1 --device=hip://2 --device=hip://3 --device=hip://4 --device=hip://5 --device=hip://6 --device=hip://7 --function=prefill_bs4 \
--input=@/shark-dev/405b/prefill_args_bs4_2048_stride_32_tp8/tokens.npy \
  --input=@/shark-dev/405b/prefill_args_bs4_2048_stride_32_tp8/seq_lens.npy \
  --input=@/shark-dev/405b/prefill_args_bs4_2048_stride_32_tp8/seq_block_ids.npy \
  --input=@/shark-dev/405b/prefill_args_bs4_2048_stride_32_tp8/cs_f16_shard_0.npy \
  --input=@/shark-dev/405b/prefill_args_bs4_2048_stride_32_tp8/cs_f16_shard_1.npy \
  --input=@/shark-dev/405b/prefill_args_bs4_2048_stride_32_tp8/cs_f16_shard_2.npy \
  --input=@/shark-dev/405b/prefill_args_bs4_2048_stride_32_tp8/cs_f16_shard_3.npy \
  --input=@/shark-dev/405b/prefill_args_bs4_2048_stride_32_tp8/cs_f16_shard_4.npy \
  --input=@/shark-dev/405b/prefill_args_bs4_2048_stride_32_tp8/cs_f16_shard_5.npy \
  --input=@/shark-dev/405b/prefill_args_bs4_2048_stride_32_tp8/cs_f16_shard_6.npy \
  --input=@/shark-dev/405b/prefill_args_bs4_2048_stride_32_tp8/cs_f16_shard_7.npy --benchmark_repetitions=8

Try above command on Shark MI300X

Steps to reproduce your issue

What component(s) does this issue relate to?

Runtime

Version information

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

aviator19941 · 2025-01-02T22:57:39Z

Seems like 405b instruct prefill tp8 with --benchmark_repetitions=3 it works. Trying with 8 benchmark repetitions now to see if I can reproduce the error.

aviator19941 · 2025-01-02T23:02:47Z

Ok increasing the benchmark_repetitions to 8, I see a segfault (core dumped). Will try to debug with asan and --trace_execution flag with iree-run-module to get the console dump of all the instructions that get executed on the host side.

aviator19941 · 2025-01-03T04:47:27Z

GDB backtrace shows the crash might be happening in another thread not visible to us, so might not be too useful:

(gdb) bt
#0  0x00007f5c902e5630 in ?? ()
#1  0x00005555555b54ee in iree_allocator_free ()
#2  0x00005555555b5f40 in iree_status_ignore ()
#3  0x00005555555dd280 in iree_hal_hip_dispatch_thread_main ()
#4  0x000055555560f75b in iree_thread_start_routine ()
#5  0x00007ffff7894ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#6  0x00007ffff7926850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

pdhirajkumarprasad added the bug 🐞 Something isn't working label Dec 31, 2024

IanWood1 mentioned this issue Jan 5, 2025

HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION: The agent attempted to access memory beyond the largest legal address. code: 0x29 #19564

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

405B prefill tp8 sharded: iree/runtime/src/iree/hal/drivers/hip/hip_buffer.c:33: iree_hal_hip_buffer_t iree_hal_hip_buffer_cast(iree_hal_buffer_t ): Assertion `!!(iree_hal_resource_is(base_value, &iree_hal_hip_buffer_vtable))' failed #19573

405B prefill tp8 sharded: iree/runtime/src/iree/hal/drivers/hip/hip_buffer.c:33: iree_hal_hip_buffer_t iree_hal_hip_buffer_cast(iree_hal_buffer_t ): Assertion `!!(iree_hal_resource_is(base_value, &iree_hal_hip_buffer_vtable))' failed #19573

pdhirajkumarprasad commented Dec 31, 2024 •

edited

Loading

aviator19941 commented Jan 2, 2025

aviator19941 commented Jan 2, 2025 •

edited

Loading

aviator19941 commented Jan 3, 2025 •

edited

Loading

405B prefill tp8 sharded: iree/runtime/src/iree/hal/drivers/hip/hip_buffer.c:33: iree_hal_hip_buffer_t *iree_hal_hip_buffer_cast(iree_hal_buffer_t *): Assertion `!!(iree_hal_resource_is(base_value, &iree_hal_hip_buffer_vtable))' failed #19573

405B prefill tp8 sharded: iree/runtime/src/iree/hal/drivers/hip/hip_buffer.c:33: iree_hal_hip_buffer_t *iree_hal_hip_buffer_cast(iree_hal_buffer_t *): Assertion `!!(iree_hal_resource_is(base_value, &iree_hal_hip_buffer_vtable))' failed #19573

Comments

pdhirajkumarprasad commented Dec 31, 2024 • edited Loading

What happened?

Steps to reproduce your issue

What component(s) does this issue relate to?

Version information

Additional context

aviator19941 commented Jan 2, 2025

aviator19941 commented Jan 2, 2025 • edited Loading

aviator19941 commented Jan 3, 2025 • edited Loading

405B prefill tp8 sharded: iree/runtime/src/iree/hal/drivers/hip/hip_buffer.c:33: iree_hal_hip_buffer_t iree_hal_hip_buffer_cast(iree_hal_buffer_t ): Assertion `!!(iree_hal_resource_is(base_value, &iree_hal_hip_buffer_vtable))' failed #19573

405B prefill tp8 sharded: iree/runtime/src/iree/hal/drivers/hip/hip_buffer.c:33: iree_hal_hip_buffer_t iree_hal_hip_buffer_cast(iree_hal_buffer_t ): Assertion `!!(iree_hal_resource_is(base_value, &iree_hal_hip_buffer_vtable))' failed #19573

pdhirajkumarprasad commented Dec 31, 2024 •

edited

Loading

aviator19941 commented Jan 2, 2025 •

edited

Loading

aviator19941 commented Jan 3, 2025 •

edited

Loading