405B-FP16-Decode-tp8 : Segmentation fault #19574

pdhirajkumarprasad · 2024-12-31T09:29:36Z

What happened?

405B-FP16-Decode-tp8 is getting segmentation fault for both iree-run-module and iree-benchmark module when token size is 2048. for token size of 128, iree-run-module is working but iree-benchmark-module is getting seg fault.

command:

python3 -m sharktank.examples.export_paged_llm_v1 --bs=4 --irpa-file=/data/llama3.1/weights/405b/fp16/tp8/llama3.1_405b_fp16_tp8_parameters.irpa --output-mlir=405b_f16_decode_tp8_nondecomposed.mlir --output-config=405b_f16_decode_tp8_nondecomposed.json

iree-compile 405b_f16_decode_tp8_nondecomposed.mlir --iree-hip-target=gfx942 -o=405b_decode_sharded.vmfb --iree-hal-target-device="hip[0]" --iree-hal-target-device="hip[1]" --iree-hal-target-device="hip[2]" --iree-hal-target-device="hip[3]" --iree-hal-target-device="hip[4]" --iree-hal-target-device="hip[5]" --iree-hal-target-device="hip[6]" --iree-hal-target-device="hip[7]" --iree-dispatch-creation-enable-aggressive-fusion=true --iree-global-opt-propagate-transposes=true --iree-opt-aggressively-propagate-transposes=true --iree-opt-data-tiling=false --iree-preprocessing-pass-pipeline='builtin.module(util.func(iree-preprocessing-generalize-linalg-matmul-experimental))' --iree-hal-indirect-command-buffers=true --iree-stream-resource-memory-model=discrete --iree-hip-legacy-sync=false --iree-hal-memoization=true --iree-opt-strip-assertions


iree-run-module --hip_use_streams=true --module=405b_decode_sharded.vmfb --parameters=model=/data/llama3.1/weights/405b/fp16/tp8/llama3.1_405b_fp16_tp8_parameters.irpa --parameters=model=/data/llama3.1/weights/405b/fp16/tp8/llama3.1_405b_fp16_tp8_parameters.rank0.irpa --parameters=model=/data/llama3.1/weights/405b/fp16/tp8/llama3.1_405b_fp16_tp8_parameters.rank1.irpa --parameters=model=/data/llama3.1/weights/405b/fp16/tp8/llama3.1_405b_fp16_tp8_parameters.rank2.irpa --parameters=model=/data/llama3.1/weights/405b/fp16/tp8/llama3.1_405b_fp16_tp8_parameters.rank3.irpa --parameters=model=/data/llama3.1/weights/405b/fp16/tp8/llama3.1_405b_fp16_tp8_parameters.rank4.irpa --parameters=model=/data/llama3.1/weights/405b/fp16/tp8/llama3.1_405b_fp16_tp8_parameters.rank5.irpa --parameters=model=/data/llama3.1/weights/405b/fp16/tp8/llama3.1_405b_fp16_tp8_parameters.rank6.irpa --parameters=model=/data/llama3.1/weights/405b/fp16/tp8/llama3.1_405b_fp16_tp8_parameters.rank7.irpa --device=hip://0 --device=hip://1 --device=hip://2 --device=hip://3 --device=hip://4 --device=hip://5 --device=hip://6 --device=hip://7 --function=decode_bs4 --input=@/shark-dev/405b/decode_args_bs4_2048_stride_32_tp8/next_tokens.npy --input=@/shark-dev/405b/decode_args_bs4_2048_stride_32_tp8/seq_lens.npy --input=@/shark-dev/405b/decode_args_bs4_2048_stride_32_tp8/start_positions.npy --input=@/shark-dev/405b/decode_args_bs4_2048_stride_32_tp8/seq_block_ids.npy --input=@/shark-dev/405b/decode_args_bs4_2048_stride_32_tp8/cs_f16_shard_0.npy --input=@/shark-dev/405b/decode_args_bs4_2048_stride_32_tp8/cs_f16_shard_1.npy --input=@/shark-dev/405b/decode_args_bs4_2048_stride_32_tp8/cs_f16_shard_2.npy --input=@/shark-dev/405b/decode_args_bs4_2048_stride_32_tp8/cs_f16_shard_3.npy --input=@/shark-dev/405b/decode_args_bs4_2048_stride_32_tp8/cs_f16_shard_4.npy --input=@/shark-dev/405b/decode_args_bs4_2048_stride_32_tp8/cs_f16_shard_5.npy --input=@/shark-dev/405b/decode_args_bs4_2048_stride_32_tp8/cs_f16_shard_6.npy --input=@/shark-dev/405b/decode_args_bs4_2048_stride_32_tp8/cs_f16_shard_7.npy

build: a43d893

Run the above commands on Shark MI300X machine.

Steps to reproduce your issue

What component(s) does this issue relate to?

Runtime

Version information

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

AWoloszyn · 2025-01-02T18:00:08Z

Are you running out of memory here? (I just fixed an issue w.r.t. propagating errors) but if it works in some configurations, you might just be running out of memory on the system.

pdhirajkumarprasad · 2025-01-02T18:14:43Z

Decode with 128/2048 have these failures and it's don't seem to be OOM issue, but I need to debug further to give exact root cause.

aviator19941 · 2025-01-03T05:05:19Z

Running module with --trace_execution:

wget https://sharkpublic.blob.core.windows.net/sharkpublic/halo-models/llm-dev/llama3_405b/405b_decode_trace_execution.txt

AWoloszyn · 2025-01-03T17:39:22Z

#19583

aviator19941 · 2025-01-03T23:41:14Z

Testing on SharkMi300x-4, the module hits 99% memory usage for GPU-0 (the other 7 GPU's hit 95%) and then crashes. These are the GDB logs I am seeing for 405b decode w/ this patch right before the crash:

I set a breakpoint in gdb in break runtime/src/iree/hal/drivers/hip/hip_allocator.c:670 and break runtime/src/iree/hal/drivers/hip/hip_allocator.c:679. It seems like the threads allocated in decode keep being used over and over again because it is still calling iree_hal_hip_allocator_alloc_async many more times compared to prefill.

https://gist.github.com/aviator19941/78b4c0e8afd72a55a9d7355a4e84dd7b

AWoloszyn · 2025-01-08T18:21:22Z

So the new async allocator has a higher watermark than the previous (the previous implementation would essentially stall the program entirely until the free could happen). You can try with the caching allocator to see if that would fix your problem, as it more closely mirrors the previous allocation strategy.

pdhirajkumarprasad added the bug 🐞 Something isn't working label Dec 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

405B-FP16-Decode-tp8 : Segmentation fault #19574

405B-FP16-Decode-tp8 : Segmentation fault #19574

pdhirajkumarprasad commented Dec 31, 2024

AWoloszyn commented Jan 2, 2025

pdhirajkumarprasad commented Jan 2, 2025

aviator19941 commented Jan 3, 2025 •

edited

Loading

AWoloszyn commented Jan 3, 2025

aviator19941 commented Jan 3, 2025 •

edited

Loading

AWoloszyn commented Jan 8, 2025

405B-FP16-Decode-tp8 : Segmentation fault #19574

405B-FP16-Decode-tp8 : Segmentation fault #19574

Comments

pdhirajkumarprasad commented Dec 31, 2024

What happened?

Steps to reproduce your issue

What component(s) does this issue relate to?

Version information

Additional context

AWoloszyn commented Jan 2, 2025

pdhirajkumarprasad commented Jan 2, 2025

aviator19941 commented Jan 3, 2025 • edited Loading

AWoloszyn commented Jan 3, 2025

aviator19941 commented Jan 3, 2025 • edited Loading

AWoloszyn commented Jan 8, 2025

aviator19941 commented Jan 3, 2025 •

edited

Loading

aviator19941 commented Jan 3, 2025 •

edited

Loading