Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NVPTX][InferAS] assume alloca instructions are in local AS #121710

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

AlexMaclean
Copy link
Member

No description provided.

@llvmbot
Copy link
Member

llvmbot commented Jan 5, 2025

@llvm/pr-subscribers-llvm-selectiondag
@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-backend-nvptx

Author: Alex MacLean (AlexMaclean)

Changes

Full diff: https://github.com/llvm/llvm-project/pull/121710.diff

3 Files Affected:

  • (modified) llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.cpp (+9-1)
  • (modified) llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h (+2)
  • (added) llvm/test/Transforms/InferAddressSpaces/NVPTX/alloca.ll (+17)
diff --git a/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.cpp b/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.cpp
index 4ec2ec100ab08d..3fe52fb33c5bd0 100644
--- a/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.cpp
@@ -20,6 +20,7 @@
 #include "llvm/IR/Value.h"
 #include "llvm/Support/Casting.h"
 #include "llvm/Support/ErrorHandling.h"
+#include "llvm/Support/NVPTXAddrSpace.h"
 #include "llvm/Transforms/InstCombine/InstCombiner.h"
 #include <optional>
 using namespace llvm;
@@ -562,4 +563,11 @@ Value *NVPTXTTIImpl::rewriteIntrinsicWithAddressSpace(IntrinsicInst *II,
   }
   }
   return nullptr;
-}
\ No newline at end of file
+}
+
+unsigned NVPTXTTIImpl::getAssumedAddrSpace(const Value *V) const {
+  if (isa<AllocaInst>(V))
+    return ADDRESS_SPACE_LOCAL;
+
+  return -1;
+}
diff --git a/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h b/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
index 0f4fb280b2d996..a378b8a37f7043 100644
--- a/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
+++ b/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
@@ -129,6 +129,8 @@ class NVPTXTTIImpl : public BasicTTIImplBase<NVPTXTTIImpl> {
 
   Value *rewriteIntrinsicWithAddressSpace(IntrinsicInst *II, Value *OldV,
                                           Value *NewV) const;
+
+  unsigned getAssumedAddrSpace(const Value *V) const;
 };
 
 } // end namespace llvm
diff --git a/llvm/test/Transforms/InferAddressSpaces/NVPTX/alloca.ll b/llvm/test/Transforms/InferAddressSpaces/NVPTX/alloca.ll
new file mode 100644
index 00000000000000..fa063cdf8d8054
--- /dev/null
+++ b/llvm/test/Transforms/InferAddressSpaces/NVPTX/alloca.ll
@@ -0,0 +1,17 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -S -passes=infer-address-spaces %s | FileCheck %s
+
+target triple = "nvptx64-nvidia-cuda"
+
+
+define float @load_alloca() {
+; CHECK-LABEL: define float @load_alloca() {
+; CHECK-NEXT:    [[ADDR:%.*]] = alloca float, align 4
+; CHECK-NEXT:    [[TMP1:%.*]] = addrspacecast ptr [[ADDR]] to ptr addrspace(5)
+; CHECK-NEXT:    [[VAL:%.*]] = load float, ptr addrspace(5) [[TMP1]], align 4
+; CHECK-NEXT:    ret float [[VAL]]
+;
+  %addr = alloca float
+  %val = load float, ptr %addr
+  ret float %val
+}

Copy link
Member

@jdoerfert jdoerfert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG

Comment on lines 28 to 31
; CHECK-NEXT: ld.u64 %rd5, [%SP+8];
; CHECK-NEXT: ld.u64 %rd6, [%SP];
; CHECK-NEXT: ld.u64 %rd7, [%SP+24];
; CHECK-NEXT: ld.u64 %rd8, [%SP+16];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this being introduced?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These loads are here because the kernel parameter is copied to local memory and then loaded back so that it may be passed to the callee. Previously SDAG was able to see that the store above was to the same address as the load, but now that the stores are to local mem this is not possible. To fix we should:

  • Look through addrspacecasts in DAG Combiner, though I'm not sure if this is valid for all targets
  • In InferAS, propagate new addrspaces into byval args of calls, since the pointer type doesn't actually matter to the callee.

Overall, while this change may interfere with this sort of load elimination for byval callees, this is a fairly minor edge case that can be addressed in a follow up.

Copy link
Contributor

@justinfargnoli justinfargnoli Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change may interfere with this sort of load elimination for byval callees, this is a fairly minor edge case that can be addressed in a follow up.

@artem do you agree with this judgment call? If so, then this PR LGTM 👍

Edit: @Artem-B

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming you mean @Artem-B

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, while this change may interfere with this sort of load elimination for byval callees, this is a fairly minor edge case that can be addressed in a follow up.

This may be more of a problem than it appears. While stores are posted and only waste some memory bandwidth without slowing down execution too much, loads will stall, and for the "local" memory that would likely mean a round-trip to the L2 cache or memory and back for a large enough grid.

I'm fairly confident that this change will upset some of our users -- we have been burned by local loads/stores in the past and actively monitor generated code for additional local memory accesses.

Considering that there's no way to mitigate the impact, I'd prefer to wait for that follow-up improvement to either land first (I think we already may have cases where unexpected reloads from local memory happen) or combine them with this change and land them at the same time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Artem-B / @justinfargnoli I've added some additional logic to the backend to address this case. The core problem here is that the byval copy is hidden until we lower out of LLVM IR and thus must be optimized solely by SDAG, which isn't meant for this sort of thing. In any case, I've added some logic to improve handling of frame indices passed to byval args, and made some other updates and bugfixes as well.

Let me know what you think and if any of these changes should be spun out into separate PRs.

@AlexMaclean AlexMaclean force-pushed the dev/amaclean/upstream-ias-alloca branch from 5c69260 to 18aaec5 Compare January 13, 2025 23:35
@llvmbot llvmbot added the llvm:SelectionDAG SelectionDAGISel as well label Jan 13, 2025
@AlexMaclean AlexMaclean requested a review from arsenm January 14, 2025 00:55
@AlexMaclean AlexMaclean force-pushed the dev/amaclean/upstream-ias-alloca branch from 18aaec5 to 2bd05a9 Compare January 14, 2025 01:02
Copy link

github-actions bot commented Jan 14, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@AlexMaclean AlexMaclean force-pushed the dev/amaclean/upstream-ias-alloca branch from 2bd05a9 to c6db77f Compare January 14, 2025 01:26
@@ -954,6 +954,12 @@ static void AddNodeIDCustom(FoldingSetNodeID &ID, const SDNode *N) {
ID.AddInteger(M);
break;
}
case ISD::ADDRSPACECAST: {
const AddrSpaceCastSDNode *ASC = cast<AddrSpaceCastSDNode>(N);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated change

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've spun this out into #122912

Comment on lines +569 to +571
return ADDRESS_SPACE_LOCAL;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is worse than just changing the alloca addrspace

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I tend to agree that converting all allocas to local addrspace would be a good idea, it's also a deep can of worms. As @Artem-B points out here #106127 (comment), optimizations may not handle allocas in specific AS as well, and supporting local allocas would likely require changes in the backend and datalayout.

I do hope to pursue this in the future, but in the meantime this change is both simple and correct. Even if we were to address the above it still seems preferable to handle generic allocas here in case InferAS happens to encounter one in the IR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only optimization using the non-0 addrspace really breaks folding compares to null out, which rarely matters if you can see the original alloca. We should also make the non-0 address space null handling datalayout configurable

Copy link
Member

@Artem-B Artem-B left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a minor nit.

; PTX32: ld.param.u32 %r{{[0-9]+}}, [foo_param_0];
; PTX32: st.volatile.u32 [%SP], %r{{[0-9]+}};
; PTX32: add.u32 %r[[SP_REG:[0-9]+]], %SPL, 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be useful to autogenerate the checks here.

; PTX64: ld.param.u32 %r{{[0-9]+}}, [foo_param_0];
; PTX64: st.volatile.u32 [%SP], %r{{[0-9]+}};
; PTX64: add.u64 %rd[[SP_REG:[0-9]+]], %SPL, 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: We don't really need to add "0" and could just use %SPL directly. This is likely cosmetic, and makes no difference in SASS.

Comment on lines +40 to +41
; PTX-NEXT: add.u64 %rd1, %SP, 0;
; PTX-NEXT: add.u64 %rd2, %SPL, 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

@AlexMaclean AlexMaclean force-pushed the dev/amaclean/upstream-ias-alloca branch from c6db77f to 369dbeb Compare January 21, 2025 21:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants