-
Notifications
You must be signed in to change notification settings - Fork 424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Sporadic ROCm Memory Access erros with SHOC Sort #26669
Comments
This is probably related to https://github.com/Cray/chapel-private/issues/6739 |
Things that we've tried to fix this so far:
This is not terrible, because it is at least consistent with what we see without Running with Based on the investigation so far, the most likely culprits are
To rule out (1) the next step would be to look into the generated LLVM code and compare what we generate with |
I am finding that the error occurs after the kernel launch finishes and fails on this line: Line 915 in cb9e9f7
--fast
I also tried turning off eager synchronization and was able to get farther through the test, but it still failed. This time with a different error
|
This is an indication of a memory corruption. Kernel does something that it shouldn't. Unless you use a GPU debugger, you see kernel's failures on synchronization (ie |
Summary of Problem
Description:
test/gpu/native/studies/shoc/shoc-sort
fails occasionally on ROCm 6 on EX when compiled with--fast
with the following error:Other failure modes include a more helpful error:
Reason: Write access to a read-only page.
instead ofReason: Unkown.
Or sometimes the sort completes but is incorrect:
Without
--fast
we don't see the failure.Is this issue currently blocking your progress?
no, it's causing nightly failures
Steps to Reproduce
test/gpu/native/studies/shoc/shoc-sort
on ROCm 6 with--fast
Compile command:
chpl --fast --output=false shoc-sort.chpl
Note that this file was refactored to not use
results.chpl
or theResultDB
record to isolate the problem a little bit.Execution command:
We run it 10 times to allow the sporadic failures to manifest
for i in {1..10}; do ./shoc-sort --passes=1; done
Associated Future Test(s):
none
Configuration Information
- Output of `chpl --version`:
- Output of `$CHPL_HOME/util/printchplenv --anonymize`:
- Back-end compiler and version, e.g. `gcc --version` or `clang --version`:
- (For Cray systems only) Output of `module list`:
The text was updated successfully, but these errors were encountered: