-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge OpenAI Triton commit a637eb2
#3330
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
recommit triton-lang/triton#5671 Don't fold allow_reorder reshape ops. This preserves the reshape and importantly the allow_reorder bit so that OptimizeThreadLocality knows it may chose a layout for the reduction in the case when the src and dst shape of the reshape is the same. Change CanonicalizeConvertFromReshape to run on efficient_layout. This allows removal of the convert layout op between the reshape and reduction. <!--- The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, **if you are a new contributor (less than 3 PRs merged) we ask that you complete the following tasks and include the filled-out checklist in your PR description.** Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [x] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [ ] I have not added any `lit` tests. - [x] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) <!--- The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, **if you are a new contributor (less than 3 PRs merged) we ask that you complete the following tasks and include the filled-out checklist in your PR description.** Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> # New contributor declaration - [ ] I am not making a trivial change, such as fixing a typo in a comment. - [ ] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [ ] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [ ] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) Co-authored-by: aeng-openai <[email protected]>
This is a re-land of #5610 with a fix to DotOperandEncodingAttr to resolve a crash exposed in internal tests. At the same time, this slightly alters the implementation of the algorithm to ensure that layouts don't get pushed back out of conditionals by the forward propagation pass. Originally, the goal of hoisting across conditionals was to ensure that cvts in fused inner loops are placed inside the prologue before pipelining peels the prologue and introduces a chain of dependencies. My guess is that @lezcano's changes to strengthen layout propagation enabled the forward propagation pass to push the conversion back out of the prologue even after pipelining (which is good), but I explicitly disabled hoisting into chains of conditionals, so these didn't balance out. This PR alters the algorithm to consider hoisting cvts across chains of conditionals only for subslices inside for loops.
If wgmma is using operands in registers, it can't be pipelined (overlapping with mma from previous iteration), as we would end up re-using the same registers while mma is still in-flight. For such cases we don't need to increase number of buffers used by operands of such mma. This change picks a low hanging fruit by not increasing the number of buffers of the register-based operand. We should go a step further and not increase numBuffers for any operand of such mma.
Now `make test` runs the exact commands used in CI, so there should be no spurious local failures due to the specific `pytest` command used.
The `getAllowReorder` check was added in #2676, but the canonicalizations are value-preserving so this is not required. Specifically: - `reshape(splat) -> splat`, order is irrelevant for splat - `reshape(reshape) -> reshape`, reshape essentially treats the input as 1d, so the input reshape has no effect.
We show the difference between two profiles on the metric we specify
… independence (#5762) In our case, this allows us to successfully run `lit` tests on Windows. Signed-off-by: Anatoly Myachev <[email protected]>
The interpreter doesn't support the `_generator` special argument at the moment.
This was a temporary workaround during blackwell development, llvm now understands sm_100a and ptx86.
… size (#5658) Theoretical memory allocation size refers to the maximum total memory size of non-overlapping memory objects within the live range of a function.
Otherwise, sometimes on macOS we may generate unnecessary files with a `_` prefix after unpacking a tar file.
Do not simply append dot op next to the memory ops. It can reorder the dependency to the dot op. Fix test failure in test_matmul.
…outs and chains of ops (#5673) We generalise `HoistLayoutConversion` to lift a given `convert_layout dot_operand` above any chain of operations that do not require data movement. We could totally generalise this in the future to lift it over other ops. We do this as a first step to keep the code somewhat similar to the previous one. Regarding the previous limitations of `canHoistDotOpEncV2` I did a bit of archeology: - The "don't hoist past select" was added in this issue triton-lang/triton#2857. I run the repro and with the recent layout fixes, it now passes. - The TruncOps being skipped comes from triton-lang/triton#2181. I think this is related with the hack that was removed in triton-lang/triton#5044, so now it should work - Same same for the `UIToFpOp`, this is now supported after #5044 - Mixed dtype hack is not necessary either as now everything works as expected with the `convert_layout` rework. We also add proper support for `isPure` for `elementwise_inline_asm` ops On the location of the code, we just leave it in `RemoveLayoutConversion.cpp` to take advantage of the rather generic implementation of `rewriteSlice`. We could totally move this pass outside of `remove-layout-conversion`, as it's probably enough to run it once. This code will go through further changes in the near future, so we'll assess this then.
This PR fixes a typo in the Windows implementation of `__builtin_clz` that was introduced in #5621. According to [this in-code comment](https://github.com/triton-lang/triton/blob/b3dcc32f387d1d54ccd6cbbbc087296c0539e703/lib/Conversion/TritonGPUToLLVM/Utility.cpp#L12) these Windows implementations should have been copied from [this gist snippet](https://gist.github.com/pps83/3210a2f980fd02bb2ba2e5a1fc4a2ef0). In the snippet however the `clz` implementation additionally [XORs the result of `_BitScanReverse`](https://gist.github.com/pps83/3210a2f980fd02bb2ba2e5a1fc4a2ef0#file-ctz_clz-cpp-L51-L53) in order to convert the result from the <i>most significant bit</i> produced by `_BitScanReverse` to the expected <i>number of leading zeros</i>. I believe the implementation was copied to the triton without the finalizing XOR by accident. <b>What is affected by this error?</b> This implementation of CLZ is used in [`pext_i32`](https://github.com/intel/intel-xpu-backend-for-triton/blob/4a9967137548f8fe9b1a93383e4fd12646352231/lib/Conversion/TritonGPUToLLVM/Utility.cpp#L635) that is used in [`delinearize`](https://github.com/intel/intel-xpu-backend-for-triton/blob/4a9967137548f8fe9b1a93383e4fd12646352231/lib/Conversion/TritonGPUToLLVM/Utility.cpp#L662) that is used by [`ReduceOpToLLVM`](https://github.com/intel/intel-xpu-backend-for-triton/blob/4a9967137548f8fe9b1a93383e4fd12646352231/lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp#L243-L247) pattern. This bug caused `tt.reduce()` ops to be incorrectly lowered on Windows in cases, where shared memory is needed to store temporary reduced results. Signed-off-by: dchigarev <[email protected]>
The bug was causing the generated code to have CUBIN array that is twice as large as the actual byte count.
…… (#5776) This reverts PR #5673 This broke the tests on A100, even though CI was green. The CI issue will be resolved by #5775
`error: unknown warning group '-Wreserved-macro-identifier'`, this flag was not defined in `clang-10`
### TL;DR (too long, didn't review) This PR re-enables the `tritonamdgpu-canonicalize-pointers` pass[^1]. The PR is effectively a complete rewrite of the original pass, which walked the AST and mutated IR in-place, using the new [`1:N` dialect conversion framework](llvm/llvm-project#116470). Recall a "fat pointer" is a tuple-like `(%baseptr, %offsetptr)` - the current (original) pass keeps this tuple in a global data structure while the new/rewritten pass emits this tuple into the IR as an `unrealized_cast(%baseptr, %offsetptr)`[^2]. Note, this PR also rewrites the existing lit test (see [this comment below](triton-lang/triton#5329 (comment))). ### Pass outline The pass structure/action is roughly: 1. Perform an approximate sparse dataflow analysis to find all transitive uses for `tt.func` args that are `tt.ptr`s; legalize only these ops; 2. Rewrite all operations' `use`s and `result`s to be `(%baseptr, %offsetptr)` using `ConversionPattern`s that takes the new `OneToNOpAdaptor`, which automatically forwards both `%baseptr` and `%offsetptr` through `adaptor.getOperands()`[^3]; 3. Clean up remaining `unrealized_casts` (currently only handling one category of such remaining casts but can be extended to handle all; see bullet 1 in TODOs). ### Some pre-emptive call outs Right up front I'll say this took a long time to figure out because **a)** the conversion framework is hugely complex **b)** it's being currently rewritten to be more robust/stable. As a consequence, the implementation is complex but I've tried hard to **a)** simplify as much as possible **b)** comment/note subtleties **c)** put in ample `assert`s and checks to clarify intent and gracefully fail. So some things to call out: 1. I called the dataflow analysis approximate because it does not actually use [DataFlow/SparseAnalysis](https://github.com/llvm/llvm-project/blob/main/mlir/lib/Analysis/DataFlow/SparseAnalysis.cpp) and instead computes a forward slice using the heuristic "transfer function" that users of an op with a `tt.ptr` operand should be rewritten. This heuristic works because the forward slice starts from `tt.ptr` args on a `tt.func` and ends at `tt.store`, which has no results. Note, there's no reason why this component of the pass can't be a true `SparseAnalysis` implementation, it's just that this rewrite has already taken way longer than I expected (so I leave that for a possible follow-up). 2. The pass uses no global `TypeConverter` but uses local `TypeConverter`s, in `BranchInterface`/`RegionInterface` patterns. This is because **a)** we are not actually converting operand/result types (we are converting number of operands/results) **b)** the conversion framework expects/handles this lack of a `TypeConverter` exactly [the way we want](https://github.com/llvm/llvm-project/blob/399c3a78a2577c6fc68bba7f301901a0e66e87ed/mlir/lib/Transforms/Utils/DialectConversion.cpp#L1179-L1185). The local type converters are used for ops that couple to basic blocks (`^bb`s) that need to have their signatures rewritten (i.e., the ops for which we need to do `rewriter.applySignatureConversion(block, *conversion, &localTypeConverter)`). That's `scf.for`, `scf.while`, `cf.br` and `cf.cond_br` (not needed for `scf.if` which has no `bb` args). 3. `tt.func` is handled differently from all of the other ops - it is not rewritten at all. Instead, for every `%arg: tt.ptr` arg, we insert into the new body `%c0 = arith.constant 0 : i32` and `%newarg = unrealized_cast(%arg, %c0) : tt.ptr` (manually, not done by the conversion framework) and replace all uses of `%arg` by `%newarg`. These are then unpacked to `(%arg, %c0)` using `replaceOpWithMultiple` so that they "magically" appear in `adaptor.getOperands()`. Then at the end, currently, these are the only unreconciled casts (because they are the only ones **not** inserted by the conversion framework) and we materialize them by just replacing uses of `%newarg` with `%arg`. 4. `scf.if` needs to be handled specially; since it has no operands but can `yield` results, we need to rewrite it only after its `yield`s have been rewritten. This is not straightforward because the dialect conversion [does a preorder walk](https://github.com/llvm/llvm-project/blob/6ab8401f53deddbd79f930ba2bec2f824c9567e3/mlir/lib/Transforms/Utils/DialectConversion.cpp#L2705). To work around this we define legality for `scf.if` to be dependent on whether its `yield`s have been rewritten (using two `UnitAttr`s on those `yield`s). Thus, `scf.if` is "legal" and not rewritten until after the results of the `yield`s are known. [^1]: I haven't actually moved it out of the flag but it's now usable with the `AMDGCN_USE_BUFFER_OPS` flag whereas it wasn't prior. [^2]: In reality it's the conversion framework that materializes this tuple as `unrealized_cast(%baseptr, %offsetptr)` and then reconciles/DCEs all the casts automatically. [^3]: The `unrealized_cast`s are completely "transparent" to the patterns, see [`ConversionPatternRewriterImpl::remapValues`](https://github.com/llvm/llvm-project/blob/399c3a78a2577c6fc68bba7f301901a0e66e87ed/mlir/lib/Transforms/Utils/DialectConversion.cpp#L1161).
This PR: - Makes AccelerateAMDMatmul pass to emit FMA i8xi8->i32 and fp16xfp16->fp32 cases - Extends AMD FMA Dot code generation with new v_dot instructions for fp16xfp16 and int8 dtypes
I had used `.ONESHELL` to allow `cd` to effect the other commands, but it seems this also prevents the error status from propagating from anything but the last command in a rule. e.g. see triton-lang/triton#5673 (comment)
…n dotOp (#5686) If the result of `convert_layout(trans(x), #dot_operand)` is not used by `tt.dot`, skip pattern match that generates `memdesc_trans`. Without explicitly going through shared memory, it will be easier to pipeline such cases for mxfp.
This allows remat to reuse existing values even if they're not dominated by the remat'd value, if we know it's going to be a no-op conversion and its users are dominated by the remat'd value. However, this is quite sketchy and was not implemented correctly. Disable it for now, although it causes some layouts to not be removed.
Enables the fp32 -> bf16 packed conversions for gfx950
The scheduler doesn't pipeline scalar loads but if loads end up in different stages for various other reasons, the pipeline tries to pipeline them and crashes. The pipeliner should be look at the first use of the load based on stages, not just if any use is in a different stage.
`tl.sum` can have surprising behaviour when invoked on lower-precision floating point or low bitwidth integer types, such as `tl.sum(i1)` actually computing xor reduce. To reduce this footgun, this PR adds a `dtype` argument to `tl.sum` (and `tl.reduce`) which optionally casts the input to that dtype before computing the operation. For `tl.sum`, the default dtype is set to `tl.int32` for integers smaller than that and `tl.float32` for floats smaller than that.
Corrected descriptor kernel to optimally epilogue subtile. To get optimal subtilling in a persistent kernel, the tile index needs to be recalculated in the epilogue to enable pipelining. This PR adds this in the descriptor kernel, so that epilogue subtiling is chosen more often after autotuning. --------- Co-authored-by: dePaul Miller <[email protected]>
This change is aimed at improving platform independence. --------- Signed-off-by: Anatoly Myachev <[email protected]>
…0)" (#5725)" This reverts commit d083ad3.
pbchekin
approved these changes
Feb 2, 2025
whitneywhtsang
changed the title
Merge OpenAI Triton commit
Merge OpenAI Triton commit Feb 2, 2025
e2c09d7
a637eb2
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR change the Triton base from f47cc3e to a637eb2 (Feb 1).
Pass rate: 98.19%
Please do not squash and merge this PR.