-
Notifications
You must be signed in to change notification settings - Fork 12.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NVPTX][InferAS] assume alloca instructions are in local AS #121710
base: main
Are you sure you want to change the base?
[NVPTX][InferAS] assume alloca instructions are in local AS #121710
Conversation
@llvm/pr-subscribers-llvm-selectiondag @llvm/pr-subscribers-backend-nvptx Author: Alex MacLean (AlexMaclean) ChangesFull diff: https://github.com/llvm/llvm-project/pull/121710.diff 3 Files Affected:
diff --git a/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.cpp b/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.cpp
index 4ec2ec100ab08d..3fe52fb33c5bd0 100644
--- a/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.cpp
@@ -20,6 +20,7 @@
#include "llvm/IR/Value.h"
#include "llvm/Support/Casting.h"
#include "llvm/Support/ErrorHandling.h"
+#include "llvm/Support/NVPTXAddrSpace.h"
#include "llvm/Transforms/InstCombine/InstCombiner.h"
#include <optional>
using namespace llvm;
@@ -562,4 +563,11 @@ Value *NVPTXTTIImpl::rewriteIntrinsicWithAddressSpace(IntrinsicInst *II,
}
}
return nullptr;
-}
\ No newline at end of file
+}
+
+unsigned NVPTXTTIImpl::getAssumedAddrSpace(const Value *V) const {
+ if (isa<AllocaInst>(V))
+ return ADDRESS_SPACE_LOCAL;
+
+ return -1;
+}
diff --git a/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h b/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
index 0f4fb280b2d996..a378b8a37f7043 100644
--- a/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
+++ b/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
@@ -129,6 +129,8 @@ class NVPTXTTIImpl : public BasicTTIImplBase<NVPTXTTIImpl> {
Value *rewriteIntrinsicWithAddressSpace(IntrinsicInst *II, Value *OldV,
Value *NewV) const;
+
+ unsigned getAssumedAddrSpace(const Value *V) const;
};
} // end namespace llvm
diff --git a/llvm/test/Transforms/InferAddressSpaces/NVPTX/alloca.ll b/llvm/test/Transforms/InferAddressSpaces/NVPTX/alloca.ll
new file mode 100644
index 00000000000000..fa063cdf8d8054
--- /dev/null
+++ b/llvm/test/Transforms/InferAddressSpaces/NVPTX/alloca.ll
@@ -0,0 +1,17 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -S -passes=infer-address-spaces %s | FileCheck %s
+
+target triple = "nvptx64-nvidia-cuda"
+
+
+define float @load_alloca() {
+; CHECK-LABEL: define float @load_alloca() {
+; CHECK-NEXT: [[ADDR:%.*]] = alloca float, align 4
+; CHECK-NEXT: [[TMP1:%.*]] = addrspacecast ptr [[ADDR]] to ptr addrspace(5)
+; CHECK-NEXT: [[VAL:%.*]] = load float, ptr addrspace(5) [[TMP1]], align 4
+; CHECK-NEXT: ret float [[VAL]]
+;
+ %addr = alloca float
+ %val = load float, ptr %addr
+ ret float %val
+}
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG
llvm/test/tools/UpdateTestChecks/update_llc_test_checks/Inputs/nvptx-basic.ll.expected
Show resolved
Hide resolved
; CHECK-NEXT: ld.u64 %rd5, [%SP+8]; | ||
; CHECK-NEXT: ld.u64 %rd6, [%SP]; | ||
; CHECK-NEXT: ld.u64 %rd7, [%SP+24]; | ||
; CHECK-NEXT: ld.u64 %rd8, [%SP+16]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this being introduced?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These loads are here because the kernel parameter is copied to local memory and then loaded back so that it may be passed to the callee. Previously SDAG was able to see that the store above was to the same address as the load, but now that the stores are to local mem this is not possible. To fix we should:
- Look through addrspacecasts in DAG Combiner, though I'm not sure if this is valid for all targets
- In InferAS, propagate new addrspaces into byval args of calls, since the pointer type doesn't actually matter to the callee.
Overall, while this change may interfere with this sort of load elimination for byval callees, this is a fairly minor edge case that can be addressed in a follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming you mean @Artem-B
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, while this change may interfere with this sort of load elimination for byval callees, this is a fairly minor edge case that can be addressed in a follow up.
This may be more of a problem than it appears. While stores are posted and only waste some memory bandwidth without slowing down execution too much, loads will stall, and for the "local" memory that would likely mean a round-trip to the L2 cache or memory and back for a large enough grid.
I'm fairly confident that this change will upset some of our users -- we have been burned by local loads/stores in the past and actively monitor generated code for additional local memory accesses.
Considering that there's no way to mitigate the impact, I'd prefer to wait for that follow-up improvement to either land first (I think we already may have cases where unexpected reloads from local memory happen) or combine them with this change and land them at the same time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Artem-B / @justinfargnoli I've added some additional logic to the backend to address this case. The core problem here is that the byval copy is hidden until we lower out of LLVM IR and thus must be optimized solely by SDAG, which isn't meant for this sort of thing. In any case, I've added some logic to improve handling of frame indices passed to byval args, and made some other updates and bugfixes as well.
Let me know what you think and if any of these changes should be spun out into separate PRs.
5c69260
to
18aaec5
Compare
18aaec5
to
2bd05a9
Compare
✅ With the latest revision this PR passed the C/C++ code formatter. |
2bd05a9
to
c6db77f
Compare
@@ -954,6 +954,12 @@ static void AddNodeIDCustom(FoldingSetNodeID &ID, const SDNode *N) { | |||
ID.AddInteger(M); | |||
break; | |||
} | |||
case ISD::ADDRSPACECAST: { | |||
const AddrSpaceCastSDNode *ASC = cast<AddrSpaceCastSDNode>(N); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've spun this out into #122912
return ADDRESS_SPACE_LOCAL; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is worse than just changing the alloca addrspace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I tend to agree that converting all allocas to local addrspace would be a good idea, it's also a deep can of worms. As @Artem-B points out here #106127 (comment), optimizations may not handle allocas in specific AS as well, and supporting local allocas would likely require changes in the backend and datalayout.
I do hope to pursue this in the future, but in the meantime this change is both simple and correct. Even if we were to address the above it still seems preferable to handle generic allocas here in case InferAS happens to encounter one in the IR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only optimization using the non-0 addrspace really breaks folding compares to null out, which rarely matters if you can see the original alloca. We should also make the non-0 address space null handling datalayout configurable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a minor nit.
; PTX32: ld.param.u32 %r{{[0-9]+}}, [foo_param_0]; | ||
; PTX32: st.volatile.u32 [%SP], %r{{[0-9]+}}; | ||
; PTX32: add.u32 %r[[SP_REG:[0-9]+]], %SPL, 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be useful to autogenerate the checks here.
; PTX64: ld.param.u32 %r{{[0-9]+}}, [foo_param_0]; | ||
; PTX64: st.volatile.u32 [%SP], %r{{[0-9]+}}; | ||
; PTX64: add.u64 %rd[[SP_REG:[0-9]+]], %SPL, 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: We don't really need to add "0" and could just use %SPL
directly. This is likely cosmetic, and makes no difference in SASS.
; PTX-NEXT: add.u64 %rd1, %SP, 0; | ||
; PTX-NEXT: add.u64 %rd2, %SPL, 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
c6db77f
to
369dbeb
Compare
No description provided.