Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Sporadic ROCm Memory Access erros with SHOC Sort #26669

Open
ShreyasKhandekar opened this issue Feb 7, 2025 · 4 comments
Open

[Bug]: Sporadic ROCm Memory Access erros with SHOC Sort #26669

ShreyasKhandekar opened this issue Feb 7, 2025 · 4 comments

Comments

@ShreyasKhandekar
Copy link
Contributor

ShreyasKhandekar commented Feb 7, 2025

Summary of Problem

Description:
test/gpu/native/studies/shoc/shoc-sort fails occasionally on ROCm 6 on EX when compiled with --fast with the following error:

Memory access fault by GPU node-4 (Agent handle: 0x5606205e4c20) on address 0x7fec94000000. Reason: Unknown.
srun: error: pinoak0004: task 0: Aborted

Other failure modes include a more helpful error: Reason: Write access to a read-only page. instead of Reason: Unkown.
Or sometimes the sort completes but is incorrect:

Fail at [index] 377105: 15 > 0
Test Failed
Verification failed

Without --fast we don't see the failure.

Is this issue currently blocking your progress?
no, it's causing nightly failures

Steps to Reproduce

test/gpu/native/studies/shoc/shoc-sort on ROCm 6 with --fast

Compile command:
chpl --fast --output=false shoc-sort.chpl

Note that this file was refactored to not use results.chpl or the ResultDB record to isolate the problem a little bit.

Execution command:
We run it 10 times to allow the sporadic failures to manifest
for i in {1..10}; do ./shoc-sort --passes=1; done

Associated Future Test(s):
none

Configuration Information

- Output of `chpl --version`:
warning: The prototype GPU support implies --no-checks. This may impact debuggability. To suppress this warning, compile with --no-checks explicitly
chpl version 2.4.0 pre-release (9af93c5437)
  built with LLVM version 19.1.3
  available LLVM targets: amdgcn, r600, nvptx64, nvptx, aarch64_32, aarch64_be, aarch64, arm64_32, arm64, x86-64, x86
Copyright 2020-2025 Hewlett Packard Enterprise Development LP
Copyright 2004-2019 Cray Inc.
(See LICENSE file for more details)
- Output of `$CHPL_HOME/util/printchplenv --anonymize`:
CHPL_TARGET_PLATFORM: hpe-cray-ex
CHPL_TARGET_COMPILER: llvm
CHPL_TARGET_ARCH: x86_64
CHPL_TARGET_CPU: x86-rome
CHPL_LOCALE_MODEL: gpu *
  CHPL_GPU: amd *
CHPL_COMM: none *
CHPL_TASKS: qthreads
CHPL_LAUNCHER: slurm-srun
CHPL_TIMERS: generic
CHPL_UNWIND: none
CHPL_TARGET_MEM: jemalloc
CHPL_ATOMICS: cstdlib
CHPL_GMP: bundled
CHPL_HWLOC: bundled
CHPL_RE2: bundled
CHPL_LLVM: bundled *
CHPL_AUX_FILESYS: none
- Back-end compiler and version, e.g. `gcc --version` or `clang --version`:
clang version 19.1.3
Target: x86_64-unknown-linux-gnu
Thread model: posix
- (For Cray systems only) Output of `module list`:
Currently Loaded Modules:
  1) craype-x86-rome                          42) ncurses/6.5-gcc-13.2.1-kdr23qt
  2) libfabric/1.22.0                         43) gettext/0.22.5-gcc-13.2.1-p2f4deg
  3) craype-network-ofi                       44) libiconv/1.17-gcc-13.2.1-gdkuk7h
  4) perftools-base/25.03.0                   45) libxml2/2.13.4-gcc-13.2.1-apyobe2
  5) xpmem/2.10.6-1.2_gfaa90a94be64           46) xz/5.4.6-gcc-13.2.1-gqdlkvs
  6) craype/2.7.34                            47) tar/1.34-gcc-13.2.1-n3n6l4p
  7) cray-dsmml/0.3.0                         48) pigz/2.8-gcc-13.2.1-yd4x355
  8) PrgEnv-gnu/8.6.0                         49) zstd/1.5.6-gcc-13.2.1-gahi2ot
  9) gcc-native/12.3                          50) libffi/3.4.6-gcc-13.2.1-yik7uxd
 10) cray-libsci/25.03.0                      51) openssl/3.4.0-gcc-13.2.1-qcgcj4v
 11) cray-mpich/8.1.32                        52) sqlite/3.46.0-gcc-13.2.1-mutpb3w
 12) cray-pmi/6.1.15                          53) util-linux-uuid/2.40.2-gcc-13.2.1-iarekqh
 13) bison/3.8.2-gcc-13.2.1-zftzcma           54) python-venv/1.0-gcc-13.2.1-k2nnnlw
 14) libyaml/0.2.5-gcc-13.2.1-ksgkxco         55) binutils/2.38-gcc-13.2.1-qyvlf6a
 15) py-protobuf/4.21.9-gcc-13.2.1-pat37oz    56) hwloc/2.11.1-gcc-13.2.1-muzlioq
 16) llvm/19.1.3-gcc-13.2.1-h36hpqx           57) libpciaccess/0.17-gcc-13.2.1-xivclig
 17) doxygen/1.9.6-gcc-13.2.1-23f74tx         58) libedit/3.1-20240808-gcc-13.2.1-xoztj3y
 18) libevent/2.1.12-gcc-13.2.1-3yqtiri       59) libsodium/1.0.20-gcc-13.2.1-a477qyy
 19) libzmq/4.3.5-gcc-13.2.1-4ifxh4t          60) curl/8.10.1-gcc-13.2.1-mbwlsb2
 20) git/2.47.0-gcc-13.2.1-hqm7kh4            61) nghttp2/1.63.0-gcc-13.2.1-qpibv7e
 21) cmake/3.30.5-gcc-13.2.1-j2j3jif          62) libidn2/2.3.7-gcc-13.2.1-bgkp6qs
 22) tmux/3.4-gcc-13.2.1-n42eord              63) libunistring/1.2-gcc-13.2.1-kpzxvqz
 23) gdb/15.2-gcc-13.2.1-sdgp7qh              64) openssh/9.9p1-gcc-13.2.1-rouko7o
 24) fltk/1.3.7-gcc-13.2.1-w7hojxd            65) krb5/1.21.3-gcc-13.2.1-lomv5tx
 25) flex/2.6.3-gcc-13.2.1-fjpaig3            66) libxcrypt/4.4.35-gcc-13.2.1-uxhoowv
 26) valgrind/3.20.0-gcc-13.2.1-7lczz2f       67) pcre2/10.44-gcc-13.2.1-mz47bfc
 27) hdf5/1.14.5-gcc-13.2.1-fmt553m           68) perl/5.40.0-gcc-13.2.1-2y7eaux
 28) gcc-runtime/13.2.1-gcc-13.2.1-vahrhoh    69) berkeley-db/18.1.40-gcc-13.2.1-4qyv3el
 29) glibc/2.38-gcc-13.2.1-iqjwcft            70) gmake/4.4.1-gcc-13.2.1-w6qrrru
 30) m4/1.4.19-gcc-13.2.1-v3wcfui             71) gmp/6.3.0-gcc-13.2.1-ddqjju3
 31) libsigsegv/2.14-gcc-13.2.1-6jcz7kv       72) mpfr/4.2.1-gcc-13.2.1-xxywphw
 32) protobuf/3.21.7-gcc-13.2.1-crnb26l       73) libx11/1.8.10-gcc-13.2.1-dn4vehr
 33) zlib-ng/2.2.1-gcc-13.2.1-m5gitv5         74) kbproto/1.0.7-gcc-13.2.1-oofkfgd
 34) py-setuptools/69.2.0-gcc-13.2.1-5apvy5n  75) libxcb/1.17.0-gcc-13.2.1-k5gvvxd
 35) python/3.13.0-gcc-13.2.1-37bcfls         76) libpthread-stubs/0.5-gcc-13.2.1-iwugd45
 36) bzip2/1.0.8-gcc-13.2.1-ii4rnb2           77) libxau/1.0.11-gcc-13.2.1-3r675ao
 37) expat/2.6.4-gcc-13.2.1-bz7xipg           78) xproto/7.0.31-gcc-13.2.1-2ydbsyf
 38) libbsd/0.12.2-gcc-13.2.1-bh6keyh         79) libxdmcp/1.1.5-gcc-13.2.1-pqz77dj
 39) libmd/1.0.4-gcc-13.2.1-l2ajh5j           80) xtrans/1.5.2-gcc-13.2.1-rhvyaqj
 40) gdbm/1.23-gcc-13.2.1-7hnsmh5             81) rocm/6.2.1
 41) readline/8.2-gcc-13.2.1-72k2jhe
@ShreyasKhandekar
Copy link
Contributor Author

This is probably related to https://github.com/Cray/chapel-private/issues/6739

@ShreyasKhandekar
Copy link
Contributor Author

Things that we've tried to fix this so far:

  1. Compared the AST logs between --fast and without it: No significant difference, which is good and expected.

  2. Dropping down all the optimizations we pass to LLVM backend to -O2, then -O1, then -O0. The errors/failures remained in -O2 and -O1, and did not manifest in -O0 (which is like turning the --fast flag into a nothing flag)

This is not terrible, because it is at least consistent with what we see without --fast which is no failures.

Running with --debugGpu points to the error coming from chpl_gpu_kernel_shoc_HYPHEN_sort_line_510_ which is the main kernel in shoc-sort. This means it has nothing to do with the results.chpl file at all, which helps us isolate the problem a little bit.

Based on the investigation so far, the most likely culprits are

  1. We generate incorrect LLVM IR which looks innocent but has bugs which are manifested when using --fast
  2. LLVM has a bug for ROCm 6 and generates something wrong when using optimizations.

To rule out (1) the next step would be to look into the generated LLVM code and compare what we generate with --fast to what we do without it.

@jabraham17
Copy link
Member

I am finding that the error occurs after the kernel launch finishes and fails on this line:

wait_stream(cfg->stream);
. This doesn't match my intuition for a LLVM failure, but doesn't make sense why this only causes issues with --fast

I also tried turning off eager synchronization and was able to get farther through the test, but it still failed. This time with a different error

finished kernel launch
Kernel launcher returning. (subloc 0)
	Kernel: chpl_gpu_kernel_shoc_HYPHEN_sort_line_427_
Deinitialized kernel config
Using stream: 0x7f43bc6e0b70 (subloc 0)
Copying 16 bytes from device to host on stream 0x7f43bc6e0b70
Memory access fault by GPU node-4 (Agent handle: 0x560b081774d0) on address 0x7f2261201000. Reason: Unknown.
srun: error: task 0: Aborted (core dumped)

@e-kayrakli
Copy link
Contributor

This is an indication of a memory corruption. Kernel does something that it shouldn't. Unless you use a GPU debugger, you see kernel's failures on synchronization (ie wait_stream). Turning off eager sync will just let things go for a while further, but than that stream will have to be synchronized when it is time to execute the next operation on the GPU (including data transfer). You'll see the issue in that implicit synchronization point in this scenario. I believe that's what you are observing here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants