-
Notifications
You must be signed in to change notification settings - Fork 645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for executable duplication in encoding specialization pass. #19527
base: main
Are you sure you want to change the base?
Conversation
The PR depends on #19502. It is ready for review, which compares the diff to the parent mirror branch. |
6e605a4
to
9e70643
Compare
cb19fce
to
86dd9b0
Compare
931032c
to
4ca9b91
Compare
4ca9b91
to
a150f11
Compare
I spent some time on the bazel failure, but I can't reproduce it on cmake build even with ASAN enabled. I'll try to set up the bazel one my env, the PR is ready to review. Please take a look, thanks! |
Duplicates stream.executables based on the affinity analysis of stream.async.dispatch ops. Some executables can be launched by different devices. It can produce wrong codegen artifacts when bindings types are encoded (i.e., the tensor type has an encoding attribute). Because they can result in different layouts, especially when multi-device is involved. E.g., say that device_a and device_b interpret a tensor type with encodings in different layouts, and there is an executable that can be launch with resources from either device_a or device_b. It is confusing what the input layouts for the executable because there are two possibilities. In this case, we have to duplicate the executable with updated encoding, and modify the dispatch to launch proper executable based on device analysis. Signed-off-by: hanhanW <[email protected]>
989dc8c
to
8ba01c3
Compare
I think I'd be able to reproduce it if I build it in release mode. I put a statement in an assertion because I wanted to check if the return value is true or not. However, the statement is dropped in release build.. |
Signed-off-by: hanhanW <[email protected]>
8ba01c3
to
c2756a0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to duplicate based on affinity - in your test we can see why: we don't want to double compile executables that are going to have the same exact compilation flow as we're just doubling our compile time (or on an 8 way config x8!).
IREE::Stream::AffinityAttr
has a areCompatible
/ canExecuteTogether
which indicates when two operations can execute in the same place (which is not what you want - that's for execution scheduling). Here you need something like areTranslationCompatible
which indicates when two affinities share the same execution configuration and only duplicate when that's not the case (you want the smallest set of unique executable targets). We could probably rename areCompatible
to areExecutionCompatible
to differentiate (at some point we'll add an areAllocationCompatible
for unified memory systems and stuff where execution and translation may differ but buffers can be shared).
Note to make the compatibility checks scale better you can gather all (like you do), use llvm::EquivalenceClasses
+ areTranslationCompatible
to build the sets of compatible devices, then run down your executables and duplicate for each non-intersecting set (this avoids doing areTranslationCompatible
n^2).
You'll want to have a test like you have with two devices that share targets ensuring they don't get duplicated and two devices with different targets ensuring they do.
MLIRContext *ctx = moduleOp.getContext(); | ||
IRRewriter rewriter(ctx); | ||
|
||
// 1. Gather per-export [execution affinity -> [resource affinities]] map. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: prefer not using numbers - they always end up wrong when code is moved around or when code is inserted/removed. You can call out preconditions/postconditions if blocks of code do have to be performed in order (e.g. "build cache of valid ops after removing dead ops" etc)
if the algorithm has a flow then that can be documented in the function comment where someone landing on the function wants to know how it works. By keeping them inline a reader has to read the whole function (like I just did, wondering where 2 was for a page :)
iree/compiler/src/iree/compiler/Dialect/Stream/IR/StreamInterfaces.td Lines 108 to 116 in 36c2353
I found the above comment in In the
IIUC, the
Do I understand it correctly? Meanwhile, I'm going to study the compatibility things in Stream dialect. It's a new thing to me! |
I just found that the current implementation is simple. It does not consider the case that I listed above. I think I can start with a similar version. iree/compiler/src/iree/compiler/Dialect/Stream/IR/StreamTypes.cpp Lines 375 to 384 in 36c2353
|
the executable targets are what we care about matching, not the affinities - we don't want to cluster by affinities, but by those sets of targets, and then generate based on required sets of targets. CUDA GPU 0 and CUDA GPU 1 (two different affinities) have the same executable targets and we don't want to duplicate, CUDA GPU 0, CUDA GPU 1, and CPU (three different affinities) have CUDA GPU 0 and CUDA GPU 1 sharing the same targets but the CPU is different and we'd want one copy for the CUDA GPUs and one copy for the CPU (because we're talking about targets, not affinities). |
(that may be what you're saying, I just realized I didn't word what I said very well :) |
Signed-off-by: hanhanW <[email protected]>
I see what you mean. I also need to implement |
(I'll rebase after I land #19726) |
Duplicates stream.executables based on the affinity analysis of stream.async.dispatch ops. Some executables can be launched by different devices. It can produce wrong codegen artifacts when bindings types are encoded (i.e., the tensor type has an encoding attribute). Because they can result in different layouts, especially when multi-device is involved. E.g., say that device_a and device_b interpret a tensor type with encodings in different layouts, and there is an executable that can be launch with resources from either device_a or device_b. It is confusing what the input layouts for the executable because there are two possibilities. In this case, we have to duplicate the executable with updated encoding, and modify the dispatch to launch proper executable based on device analysis.