Merge OpenAI Triton commit `a637eb2` #3330

whitneywhtsang · 2025-02-01T23:05:36Z

This PR change the Triton base from f47cc3e to a637eb2 (Feb 1).
Pass rate: 98.19%

Please do not squash and merge this PR.

recommit triton-lang/triton#5671 Don't fold allow_reorder reshape ops. This preserves the reshape and importantly the allow_reorder bit so that OptimizeThreadLocality knows it may chose a layout for the reduction in the case when the src and dst shape of the reshape is the same. Change CanonicalizeConvertFromReshape to run on efficient_layout. This allows removal of the convert layout op between the reshape and reduction.  # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [x] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [ ] I have not added any `lit` tests. - [x] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.)  # New contributor declaration - [ ] I am not making a trivial change, such as fixing a typo in a comment. - [ ] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [ ] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [ ] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) Co-authored-by: aeng-openai <[email protected]>

@lezcano

This is a re-land of #5610 with a fix to DotOperandEncodingAttr to resolve a crash exposed in internal tests. At the same time, this slightly alters the implementation of the algorithm to ensure that layouts don't get pushed back out of conditionals by the forward propagation pass. Originally, the goal of hoisting across conditionals was to ensure that cvts in fused inner loops are placed inside the prologue before pipelining peels the prologue and introduces a chain of dependencies. My guess is that @lezcano's changes to strengthen layout propagation enabled the forward propagation pass to push the conversion back out of the prologue even after pipelining (which is good), but I explicitly disabled hoisting into chains of conditionals, so these didn't balance out. This PR alters the algorithm to consider hoisting cvts across chains of conditionals only for subslices inside for loops.

If wgmma is using operands in registers, it can't be pipelined (overlapping with mma from previous iteration), as we would end up re-using the same registers while mma is still in-flight. For such cases we don't need to increase number of buffers used by operands of such mma. This change picks a low hanging fruit by not increasing the number of buffers of the register-based operand. We should go a step further and not increase numBuffers for any operand of such mma.

Now `make test` runs the exact commands used in CI, so there should be no spurious local failures due to the specific `pytest` command used.

The `getAllowReorder` check was added in #2676, but the canonicalizations are value-preserving so this is not required. Specifically: - `reshape(splat) -> splat`, order is irrelevant for splat - `reshape(reshape) -> reshape`, reshape essentially treats the input as 1d, so the input reshape has no effect.

We show the difference between two profiles on the metric we specify

… independence (#5762) In our case, this allows us to successfully run `lit` tests on Windows. Signed-off-by: Anatoly Myachev <[email protected]>

The interpreter doesn't support the `_generator` special argument at the moment.

This was a temporary workaround during blackwell development, llvm now understands sm_100a and ptx86.

… size (#5658) Theoretical memory allocation size refers to the maximum total memory size of non-overlapping memory objects within the live range of a function.

Otherwise, sometimes on macOS we may generate unnecessary files with a `_` prefix after unpacking a tar file.

Do not simply append dot op next to the memory ops. It can reorder the dependency to the dot op. Fix test failure in test_matmul.

…outs and chains of ops (#5673) We generalise `HoistLayoutConversion` to lift a given `convert_layout dot_operand` above any chain of operations that do not require data movement. We could totally generalise this in the future to lift it over other ops. We do this as a first step to keep the code somewhat similar to the previous one. Regarding the previous limitations of `canHoistDotOpEncV2` I did a bit of archeology: - The "don't hoist past select" was added in this issue triton-lang/triton#2857. I run the repro and with the recent layout fixes, it now passes. - The TruncOps being skipped comes from triton-lang/triton#2181. I think this is related with the hack that was removed in triton-lang/triton#5044, so now it should work - Same same for the `UIToFpOp`, this is now supported after #5044 - Mixed dtype hack is not necessary either as now everything works as expected with the `convert_layout` rework. We also add proper support for `isPure` for `elementwise_inline_asm` ops On the location of the code, we just leave it in `RemoveLayoutConversion.cpp` to take advantage of the rather generic implementation of `rewriteSlice`. We could totally move this pass outside of `remove-layout-conversion`, as it's probably enough to run it once. This code will go through further changes in the near future, so we'll assess this then.

This PR fixes a typo in the Windows implementation of `__builtin_clz` that was introduced in #5621. According to [this in-code comment](https://github.com/triton-lang/triton/blob/b3dcc32f387d1d54ccd6cbbbc087296c0539e703/lib/Conversion/TritonGPUToLLVM/Utility.cpp#L12) these Windows implementations should have been copied from [this gist snippet](https://gist.github.com/pps83/3210a2f980fd02bb2ba2e5a1fc4a2ef0). In the snippet however the `clz` implementation additionally [XORs the result of `_BitScanReverse`](https://gist.github.com/pps83/3210a2f980fd02bb2ba2e5a1fc4a2ef0#file-ctz_clz-cpp-L51-L53) in order to convert the result from the most significant bit produced by `_BitScanReverse` to the expected number of leading zeros. I believe the implementation was copied to the triton without the finalizing XOR by accident. What is affected by this error? This implementation of CLZ is used in [`pext_i32`](https://github.com/intel/intel-xpu-backend-for-triton/blob/4a9967137548f8fe9b1a93383e4fd12646352231/lib/Conversion/TritonGPUToLLVM/Utility.cpp#L635) that is used in [`delinearize`](https://github.com/intel/intel-xpu-backend-for-triton/blob/4a9967137548f8fe9b1a93383e4fd12646352231/lib/Conversion/TritonGPUToLLVM/Utility.cpp#L662) that is used by [`ReduceOpToLLVM`](https://github.com/intel/intel-xpu-backend-for-triton/blob/4a9967137548f8fe9b1a93383e4fd12646352231/lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp#L243-L247) pattern. This bug caused `tt.reduce()` ops to be incorrectly lowered on Windows in cases, where shared memory is needed to store temporary reduced results. Signed-off-by: dchigarev <[email protected]>

The bug was causing the generated code to have CUBIN array that is twice as large as the actual byte count.

…… (#5776) This reverts PR #5673 This broke the tests on A100, even though CI was green. The CI issue will be resolved by #5775

`error: unknown warning group '-Wreserved-macro-identifier'`, this flag was not defined in `clang-10`

### TL;DR (too long, didn't review) This PR re-enables the `tritonamdgpu-canonicalize-pointers` pass[^1]. The PR is effectively a complete rewrite of the original pass, which walked the AST and mutated IR in-place, using the new [`1:N` dialect conversion framework](llvm/llvm-project#116470). Recall a "fat pointer" is a tuple-like `(%baseptr, %offsetptr)` - the current (original) pass keeps this tuple in a global data structure while the new/rewritten pass emits this tuple into the IR as an `unrealized_cast(%baseptr, %offsetptr)`[^2]. Note, this PR also rewrites the existing lit test (see [this comment below](triton-lang/triton#5329 (comment))). ### Pass outline The pass structure/action is roughly: 1. Perform an approximate sparse dataflow analysis to find all transitive uses for `tt.func` args that are `tt.ptr`s; legalize only these ops; 2. Rewrite all operations' `use`s and `result`s to be `(%baseptr, %offsetptr)` using `ConversionPattern`s that takes the new `OneToNOpAdaptor`, which automatically forwards both `%baseptr` and `%offsetptr` through `adaptor.getOperands()`[^3]; 3. Clean up remaining `unrealized_casts` (currently only handling one category of such remaining casts but can be extended to handle all; see bullet 1 in TODOs). ### Some pre-emptive call outs Right up front I'll say this took a long time to figure out because **a)** the conversion framework is hugely complex **b)** it's being currently rewritten to be more robust/stable. As a consequence, the implementation is complex but I've tried hard to **a)** simplify as much as possible **b)** comment/note subtleties **c)** put in ample `assert`s and checks to clarify intent and gracefully fail. So some things to call out: 1. I called the dataflow analysis approximate because it does not actually use [DataFlow/SparseAnalysis](https://github.com/llvm/llvm-project/blob/main/mlir/lib/Analysis/DataFlow/SparseAnalysis.cpp) and instead computes a forward slice using the heuristic "transfer function" that users of an op with a `tt.ptr` operand should be rewritten. This heuristic works because the forward slice starts from `tt.ptr` args on a `tt.func` and ends at `tt.store`, which has no results. Note, there's no reason why this component of the pass can't be a true `SparseAnalysis` implementation, it's just that this rewrite has already taken way longer than I expected (so I leave that for a possible follow-up). 2. The pass uses no global `TypeConverter` but uses local `TypeConverter`s, in `BranchInterface`/`RegionInterface` patterns. This is because **a)** we are not actually converting operand/result types (we are converting number of operands/results) **b)** the conversion framework expects/handles this lack of a `TypeConverter` exactly [the way we want](https://github.com/llvm/llvm-project/blob/399c3a78a2577c6fc68bba7f301901a0e66e87ed/mlir/lib/Transforms/Utils/DialectConversion.cpp#L1179-L1185). The local type converters are used for ops that couple to basic blocks (`^bb`s) that need to have their signatures rewritten (i.e., the ops for which we need to do `rewriter.applySignatureConversion(block, *conversion, &localTypeConverter)`). That's `scf.for`, `scf.while`, `cf.br` and `cf.cond_br` (not needed for `scf.if` which has no `bb` args). 3. `tt.func` is handled differently from all of the other ops - it is not rewritten at all. Instead, for every `%arg: tt.ptr` arg, we insert into the new body `%c0 = arith.constant 0 : i32` and `%newarg = unrealized_cast(%arg, %c0) : tt.ptr` (manually, not done by the conversion framework) and replace all uses of `%arg` by `%newarg`. These are then unpacked to `(%arg, %c0)` using `replaceOpWithMultiple` so that they "magically" appear in `adaptor.getOperands()`. Then at the end, currently, these are the only unreconciled casts (because they are the only ones **not** inserted by the conversion framework) and we materialize them by just replacing uses of `%newarg` with `%arg`. 4. `scf.if` needs to be handled specially; since it has no operands but can `yield` results, we need to rewrite it only after its `yield`s have been rewritten. This is not straightforward because the dialect conversion [does a preorder walk](https://github.com/llvm/llvm-project/blob/6ab8401f53deddbd79f930ba2bec2f824c9567e3/mlir/lib/Transforms/Utils/DialectConversion.cpp#L2705). To work around this we define legality for `scf.if` to be dependent on whether its `yield`s have been rewritten (using two `UnitAttr`s on those `yield`s). Thus, `scf.if` is "legal" and not rewritten until after the results of the `yield`s are known. [^1]: I haven't actually moved it out of the flag but it's now usable with the `AMDGCN_USE_BUFFER_OPS` flag whereas it wasn't prior. [^2]: In reality it's the conversion framework that materializes this tuple as `unrealized_cast(%baseptr, %offsetptr)` and then reconciles/DCEs all the casts automatically. [^3]: The `unrealized_cast`s are completely "transparent" to the patterns, see [`ConversionPatternRewriterImpl::remapValues`](https://github.com/llvm/llvm-project/blob/399c3a78a2577c6fc68bba7f301901a0e66e87ed/mlir/lib/Transforms/Utils/DialectConversion.cpp#L1161).

This PR: - Makes AccelerateAMDMatmul pass to emit FMA i8xi8->i32 and fp16xfp16->fp32 cases - Extends AMD FMA Dot code generation with new v_dot instructions for fp16xfp16 and int8 dtypes

I had used `.ONESHELL` to allow `cd` to effect the other commands, but it seems this also prevents the error status from propagating from anything but the last command in a rule. e.g. see triton-lang/triton#5673 (comment)

…n dotOp (#5686) If the result of `convert_layout(trans(x), #dot_operand)` is not used by `tt.dot`, skip pattern match that generates `memdesc_trans`. Without explicitly going through shared memory, it will be easier to pipeline such cases for mxfp.

This allows remat to reuse existing values even if they're not dominated by the remat'd value, if we know it's going to be a no-op conversion and its users are dominated by the remat'd value. However, this is quite sketchy and was not implemented correctly. Disable it for now, although it causes some layouts to not be removed.

Enables the fp32 -> bf16 packed conversions for gfx950

The scheduler doesn't pipeline scalar loads but if loads end up in different stages for various other reasons, the pipeline tries to pipeline them and crashes. The pipeliner should be look at the first use of the load based on stages, not just if any use is in a different stage.

`tl.sum` can have surprising behaviour when invoked on lower-precision floating point or low bitwidth integer types, such as `tl.sum(i1)` actually computing xor reduce. To reduce this footgun, this PR adds a `dtype` argument to `tl.sum` (and `tl.reduce`) which optionally casts the input to that dtype before computing the operation. For `tl.sum`, the default dtype is set to `tl.int32` for integers smaller than that and `tl.float32` for floats smaller than that.

Corrected descriptor kernel to optimally epilogue subtile. To get optimal subtilling in a persistent kernel, the tile index needs to be recalculated in the epilogue to enable pipelining. This PR adds this in the descriptor kernel, so that epilogue subtiling is chosen more often after autotuning. --------- Co-authored-by: dePaul Miller <[email protected]>

This change is aimed at improving platform independence. --------- Signed-off-by: Anatoly Myachev <[email protected]>

…0)" (#5725)" This reverts commit d083ad3.

Signed-off-by: Whitney Tsang <[email protected]>

ThomasRaoux and others added 30 commits January 29, 2025 21:24

[DEV] Unify Makefile and cuda CI commands (#5753)

c75c6b0

Now `make test` runs the exact commands used in CI, so there should be no spurious local failures due to the specific `pytest` command used.

[PROTON] Add the -diff option to proton-viewer (#5740)

a605ded

We show the difference between two profiles on the metric we specify

Use env builtin implementation from LLVM's lit utility for platform…

0fa1dbd

… independence (#5762) In our case, this allows us to successfully run `lit` tests on Windows. Signed-off-by: Anatoly Myachev <[email protected]>

[DOC] Update core maintainers list (#5767)

d2c8852

[Triton] Change xor_sum to use @jit (NFC) (#5769)

d102143

The interpreter doesn't support the `_generator` special argument at the moment.

[DRIVER] Pass correct SM and PTX versions to llvm (#5770)

fc98119

This was a temporary workaround during blackwell development, llvm now understands sm_100a and ptx86.

[ANALYSIS][DEBUG] Output theoretical vs actual peak memory allocation…

9dfab8f

… size (#5658) Theoretical memory allocation size refers to the maximum total memory size of non-overlapping memory objects within the live range of a function.

[PROTON] Explicitly list all cpp files (#5756)

e2c09d7

Otherwise, sometimes on macOS we may generate unnecessary files with a `_` prefix after unpacking a tar file.

[AMD][BACKEND] Bugfix to small tile pingpong (#5759)

4ce54b5

Do not simply append dot op next to the memory ops. It can reorder the dependency to the dot op. Fix test failure in test_matmul.

[TOOLS] Fixed bug in AOT compiler (#5771)

0ffc67d

The bug was causing the generated code to have CUBIN array that is twice as large as the actual byte count.

Revert "[LAYOUTS] Generalise HoistLayoutConversion to work with arbit…

fb56ae3

…… (#5776) This reverts PR #5673 This broke the tests on A100, even though CI was green. The CI issue will be resolved by #5775

[PROTON] Skip warnings caused by legacy clang compilers (#5778)

91c812e

`error: unknown warning group '-Wreserved-macro-identifier'`, this flag was not defined in `clang-10`

[AMD] Emit AMD specific intrinsics for FMA dot (#4594)

2926ae7

This PR: - Makes AccelerateAMDMatmul pass to emit FMA i8xi8->i32 and fp16xfp16->fp32 cases - Extends AMD FMA Dot code generation with new v_dot instructions for fp16xfp16 and int8 dtypes

[DEV] Don't use .ONESHELL in Makefile (#5775)

0af2f62

I had used `.ONESHELL` to allow `cd` to effect the other commands, but it seems this also prevents the error status from propagating from anything but the last command in a rule. e.g. see triton-lang/triton#5673 (comment)

[AMD] Add GFX950 fp32 to bf16 Conversion Ops (#5782)

ec494ce

Enables the fp32 -> bf16 packed conversions for gfx950

Use os.sep in filter_traceback function (#5781)

a637eb2

This change is aimed at improving platform independence. --------- Signed-off-by: Anatoly Myachev <[email protected]>

Merge commit 'd083ad3764ca5f23ef4b696de73440f8f8c7abc0'

43253fb

Revert " Reapply "[Layouts] Propagate layouts into conditionals (#561…

942cf94

…0)" (#5725)" This reverts commit d083ad3.

Merge commit 'e2c09d72e0b1ba3cffd9e504521ef1ee1fc8e410'

2cb0c82

whitneywhtsang requested review from pbchekin and anmyachev February 1, 2025 23:05

whitneywhtsang self-assigned this Feb 1, 2025

whitneywhtsang marked this pull request as ready for review February 2, 2025 00:33

whitneywhtsang added 3 commits February 2, 2025 00:55

Merge commit 'f9ad25edcea3533d16bb1d63da7c95383e3b2d1b'

fd77b2d

[TEST] Fix test_sum_dtype

0e888dc

Signed-off-by: Whitney Tsang <[email protected]>

Merge commit 'a637eb292334caca84eef34db53521da5de48bbd'

b9ba137

pbchekin approved these changes Feb 2, 2025

View reviewed changes

whitneywhtsang merged commit b9ba137 into main Feb 2, 2025
5 checks passed

whitneywhtsang deleted the whitneywhtsang/merge branch February 2, 2025 06:32

whitneywhtsang changed the title ~~Merge OpenAI Triton commit e2c09d7~~ Merge OpenAI Triton commit a637eb2 Feb 2, 2025

whitneywhtsang mentioned this pull request Feb 2, 2025

Reland upstream commit d083ad3 #3331

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge OpenAI Triton commit `a637eb2` #3330

Merge OpenAI Triton commit `a637eb2` #3330

whitneywhtsang commented Feb 1, 2025 •

edited

Loading

Merge OpenAI Triton commit a637eb2 #3330

Merge OpenAI Triton commit a637eb2 #3330

Conversation

whitneywhtsang commented Feb 1, 2025 • edited Loading

Merge OpenAI Triton commit `a637eb2` #3330

Merge OpenAI Triton commit `a637eb2` #3330

whitneywhtsang commented Feb 1, 2025 •

edited

Loading