Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge OpenAI Triton commit a637eb2 #3330

Merged
merged 34 commits into from
Feb 2, 2025
Merged

Conversation

whitneywhtsang
Copy link
Contributor

@whitneywhtsang whitneywhtsang commented Feb 1, 2025

This PR change the Triton base from f47cc3e to a637eb2 (Feb 1).
Pass rate: 98.19%

Please do not squash and merge this PR.

ThomasRaoux and others added 30 commits January 29, 2025 21:24
recommit triton-lang/triton#5671

Don't fold allow_reorder reshape ops. This preserves the reshape and
importantly the allow_reorder bit so that OptimizeThreadLocality knows
it may chose a layout for the reduction in the case when the src and dst
shape of the reshape is the same.

Change CanonicalizeConvertFromReshape to run on efficient_layout. This
allows removal of the convert layout op between the reshape and
reduction.

<!---
The core Triton is a small number of people, and we receive many PRs
(thank
you!). To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them.
-->

# New contributor declaration
- [x] I am not making a trivial change, such as fixing a typo in a
comment.

- [x] I have written a PR description following these
[rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [x] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
  - [ ] This PR does not need a test because `FILL THIS IN`.

- Select one of the following.
  - [ ] I have not added any `lit` tests.
- [x] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)

<!---
The core Triton is a small number of people, and we receive many PRs
(thank
you!).  To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them.
-->

# New contributor declaration
- [ ] I am not making a trivial change, such as fixing a typo in a
comment.

- [ ] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [ ] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [ ] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
  - [ ] This PR does not need a test because `FILL THIS IN`.

- Select one of the following.
  - [ ] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)

Co-authored-by: aeng-openai <[email protected]>
This is a re-land of #5610 with a fix to DotOperandEncodingAttr to
resolve a crash exposed in internal tests.

At the same time, this slightly alters the implementation of the
algorithm to ensure that layouts don't get pushed back out of
conditionals by the forward propagation pass. Originally, the goal of
hoisting across conditionals was to ensure that cvts in fused inner
loops are placed inside the prologue before pipelining peels the
prologue and introduces a chain of dependencies.

My guess is that @lezcano's changes to strengthen layout propagation
enabled the forward propagation pass to push the conversion back out of
the prologue even after pipelining (which is good), but I explicitly
disabled hoisting into chains of conditionals, so these didn't balance
out. This PR alters the algorithm to consider hoisting cvts across
chains of conditionals only for subslices inside for loops.
If wgmma is using operands in registers, it can't be pipelined
(overlapping with mma from previous iteration), as we would end up
re-using the same registers while mma is still in-flight. For such cases
we don't need to increase number of buffers used by operands of such
mma.
This change picks a low hanging fruit by not increasing the number of
buffers of the register-based operand. We should go a step further and
not increase numBuffers for any operand of such mma.
Now `make test` runs the exact commands used in CI, so there should be
no spurious local failures due to the specific `pytest` command used.
The `getAllowReorder` check was added in #2676, but the
canonicalizations are value-preserving so this is not required.

Specifically:
- `reshape(splat) -> splat`, order is irrelevant for splat
- `reshape(reshape) -> reshape`, reshape essentially treats the input as
1d, so the input reshape has no effect.
We show the difference between two profiles on the metric we specify
… independence (#5762)

In our case, this allows us to successfully run `lit` tests on Windows.

Signed-off-by: Anatoly Myachev <[email protected]>
The interpreter doesn't support the `_generator` special argument at the
moment.
This was a temporary workaround during blackwell development, llvm now
understands sm_100a and ptx86.
… size (#5658)

Theoretical memory allocation size refers to the maximum total memory
size of non-overlapping memory objects within the live range of a
function.
Otherwise, sometimes on macOS we may generate unnecessary files with a
`_` prefix after unpacking a tar file.
Do not simply append dot op next to the memory ops.
It can reorder the dependency to the dot op.
Fix test failure in test_matmul.
…outs and chains of ops (#5673)

We generalise `HoistLayoutConversion` to lift a given `convert_layout
dot_operand`
above any chain of operations that do not require data movement. We
could totally generalise this in the future to lift it over other ops.
We do
this as a first step to keep the code somewhat similar to the previous
one.

Regarding the previous limitations of `canHoistDotOpEncV2` I did a bit
of archeology:
- The "don't hoist past select" was added in this issue
triton-lang/triton#2857. I run the repro and
with the recent layout fixes, it now passes.
- The TruncOps being skipped comes from
triton-lang/triton#2181. I think this is
related with the hack that was removed in
triton-lang/triton#5044, so now it should work
- Same same for the `UIToFpOp`, this is now supported after #5044
- Mixed dtype hack is not necessary either as now everything works as
expected with the `convert_layout` rework.

We also add proper support for `isPure` for `elementwise_inline_asm` ops

On the location of the code, we just leave it in
`RemoveLayoutConversion.cpp` to
take advantage of the rather generic implementation of `rewriteSlice`.
We could totally
move this pass outside of `remove-layout-conversion`, as it's probably
enough to run
it once. This code will go through further changes in the near future,
so we'll assess this
then.
This PR fixes a typo in the Windows implementation of `__builtin_clz`
that was introduced in #5621.

According to [this in-code
comment](https://github.com/triton-lang/triton/blob/b3dcc32f387d1d54ccd6cbbbc087296c0539e703/lib/Conversion/TritonGPUToLLVM/Utility.cpp#L12)
these Windows implementations should have been copied from [this gist
snippet](https://gist.github.com/pps83/3210a2f980fd02bb2ba2e5a1fc4a2ef0).
In the snippet however the `clz` implementation additionally [XORs the
result of
`_BitScanReverse`](https://gist.github.com/pps83/3210a2f980fd02bb2ba2e5a1fc4a2ef0#file-ctz_clz-cpp-L51-L53)
in order to convert the result from the <i>most significant bit</i>
produced by `_BitScanReverse` to the expected <i>number of leading
zeros</i>. I believe the implementation was copied to the triton without
the finalizing XOR by accident.

<b>What is affected by this error?</b>
This implementation of CLZ is used in
[`pext_i32`](https://github.com/intel/intel-xpu-backend-for-triton/blob/4a9967137548f8fe9b1a93383e4fd12646352231/lib/Conversion/TritonGPUToLLVM/Utility.cpp#L635)
that is used in
[`delinearize`](https://github.com/intel/intel-xpu-backend-for-triton/blob/4a9967137548f8fe9b1a93383e4fd12646352231/lib/Conversion/TritonGPUToLLVM/Utility.cpp#L662)
that is used by
[`ReduceOpToLLVM`](https://github.com/intel/intel-xpu-backend-for-triton/blob/4a9967137548f8fe9b1a93383e4fd12646352231/lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp#L243-L247)
pattern. This bug caused `tt.reduce()` ops to be incorrectly lowered on
Windows in cases, where shared memory is needed to store temporary
reduced results.

Signed-off-by: dchigarev <[email protected]>
The bug was causing the generated code to have CUBIN array that is twice
as large as the actual byte count.
…… (#5776)

This reverts PR #5673

This broke the tests on A100, even though CI was green. The CI issue
will be resolved by #5775
`error: unknown warning group '-Wreserved-macro-identifier'`, this flag
was not defined in `clang-10`
### TL;DR (too long, didn't review)

This PR re-enables the `tritonamdgpu-canonicalize-pointers` pass[^1].
The PR is effectively a complete rewrite of the original pass, which
walked the AST and mutated IR in-place, using the new [`1:N` dialect
conversion framework](llvm/llvm-project#116470).
Recall a "fat pointer" is a tuple-like `(%baseptr, %offsetptr)` - the
current (original) pass keeps this tuple in a global data structure
while the new/rewritten pass emits this tuple into the IR as an
`unrealized_cast(%baseptr, %offsetptr)`[^2].

Note, this PR also rewrites the existing lit test (see [this comment
below](triton-lang/triton#5329 (comment))).

### Pass outline

The pass structure/action is roughly:

1. Perform an approximate sparse dataflow analysis to find all
transitive uses for `tt.func` args that are `tt.ptr`s; legalize only
these ops;
2. Rewrite all operations' `use`s and `result`s to be `(%baseptr,
%offsetptr)` using `ConversionPattern`s that takes the new
`OneToNOpAdaptor`, which automatically forwards both `%baseptr` and
`%offsetptr` through `adaptor.getOperands()`[^3];
3. Clean up remaining `unrealized_casts` (currently only handling one
category of such remaining casts but can be extended to handle all; see
bullet 1 in TODOs).

### Some pre-emptive call outs

Right up front I'll say this took a long time to figure out because
**a)** the conversion framework is hugely complex **b)** it's being
currently rewritten to be more robust/stable. As a consequence, the
implementation is complex but I've tried hard to **a)** simplify as much
as possible **b)** comment/note subtleties **c)** put in ample `assert`s
and checks to clarify intent and gracefully fail. So some things to call
out:

1. I called the dataflow analysis approximate because it does not
actually use
[DataFlow/SparseAnalysis](https://github.com/llvm/llvm-project/blob/main/mlir/lib/Analysis/DataFlow/SparseAnalysis.cpp)
and instead computes a forward slice using the heuristic "transfer
function" that users of an op with a `tt.ptr` operand should be
rewritten. This heuristic works because the forward slice starts from
`tt.ptr` args on a `tt.func` and ends at `tt.store`, which has no
results. Note, there's no reason why this component of the pass can't be
a true `SparseAnalysis` implementation, it's just that this rewrite has
already taken way longer than I expected (so I leave that for a possible
follow-up).
2. The pass uses no global `TypeConverter` but uses local
`TypeConverter`s, in `BranchInterface`/`RegionInterface` patterns. This
is because **a)** we are not actually converting operand/result types
(we are converting number of operands/results) **b)** the conversion
framework expects/handles this lack of a `TypeConverter` exactly [the
way we
want](https://github.com/llvm/llvm-project/blob/399c3a78a2577c6fc68bba7f301901a0e66e87ed/mlir/lib/Transforms/Utils/DialectConversion.cpp#L1179-L1185).
The local type converters are used for ops that couple to basic blocks
(`^bb`s) that need to have their signatures rewritten (i.e., the ops for
which we need to do `rewriter.applySignatureConversion(block,
*conversion, &localTypeConverter)`). That's `scf.for`, `scf.while`,
`cf.br` and `cf.cond_br` (not needed for `scf.if` which has no `bb`
args).
3. `tt.func` is handled differently from all of the other ops - it is
not rewritten at all. Instead, for every `%arg: tt.ptr` arg, we insert
into the new body `%c0 = arith.constant 0 : i32` and `%newarg =
unrealized_cast(%arg, %c0) : tt.ptr` (manually, not done by the
conversion framework) and replace all uses of `%arg` by `%newarg`. These
are then unpacked to `(%arg, %c0)` using `replaceOpWithMultiple` so that
they "magically" appear in `adaptor.getOperands()`. Then at the end,
currently, these are the only unreconciled casts (because they are the
only ones **not** inserted by the conversion framework) and we
materialize them by just replacing uses of `%newarg` with `%arg`.
4. `scf.if` needs to be handled specially; since it has no operands but
can `yield` results, we need to rewrite it only after its `yield`s have
been rewritten. This is not straightforward because the dialect
conversion [does a preorder
walk](https://github.com/llvm/llvm-project/blob/6ab8401f53deddbd79f930ba2bec2f824c9567e3/mlir/lib/Transforms/Utils/DialectConversion.cpp#L2705).
To work around this we define legality for `scf.if` to be dependent on
whether its `yield`s have been rewritten (using two `UnitAttr`s on those
`yield`s). Thus, `scf.if` is "legal" and not rewritten until after the
results of the `yield`s are known.


[^1]: I haven't actually moved it out of the flag but it's now usable
with the `AMDGCN_USE_BUFFER_OPS` flag whereas it wasn't prior.
[^2]: In reality it's the conversion framework that materializes this
tuple as `unrealized_cast(%baseptr, %offsetptr)` and then
reconciles/DCEs all the casts automatically.
[^3]: The `unrealized_cast`s are completely "transparent" to the
patterns, see
[`ConversionPatternRewriterImpl::remapValues`](https://github.com/llvm/llvm-project/blob/399c3a78a2577c6fc68bba7f301901a0e66e87ed/mlir/lib/Transforms/Utils/DialectConversion.cpp#L1161).
This PR:

- Makes AccelerateAMDMatmul pass to emit FMA i8xi8->i32 and
fp16xfp16->fp32 cases
- Extends AMD FMA Dot code generation with new v_dot instructions for
fp16xfp16 and int8 dtypes
I had used `.ONESHELL` to allow `cd` to effect the other commands, but
it seems this also prevents the error status from propagating from
anything but the last command in a rule.

e.g. see
triton-lang/triton#5673 (comment)
…n dotOp (#5686)

If the result of `convert_layout(trans(x), #dot_operand)` is not used by
`tt.dot`, skip pattern match that generates `memdesc_trans`. Without
explicitly going through shared memory, it will be easier to pipeline
such cases for mxfp.
This allows remat to reuse existing values even if they're not dominated
by the remat'd value, if we know it's going to be a no-op conversion and
its users are dominated by the remat'd value. However, this is quite
sketchy and was not implemented correctly.

Disable it for now, although it causes some layouts to not be removed.
Enables the fp32 -> bf16 packed conversions for gfx950
The scheduler doesn't pipeline scalar loads but if loads end up in
different stages for various other reasons, the pipeline tries to
pipeline them and crashes. The pipeliner should be look at the first use
of the load based on stages, not just if any use is in a different
stage.
`tl.sum` can have surprising behaviour when invoked on lower-precision
floating point or low bitwidth integer types, such as `tl.sum(i1)`
actually computing xor reduce.

To reduce this footgun, this PR adds a `dtype` argument to `tl.sum` (and
`tl.reduce`) which optionally casts the input to that dtype before
computing the operation. For `tl.sum`, the default dtype is set to
`tl.int32` for integers smaller than that and `tl.float32` for floats
smaller than that.
Corrected descriptor kernel to optimally epilogue subtile.

To get optimal subtilling in a persistent kernel, the tile index needs
to be recalculated in the epilogue to enable pipelining. This PR adds
this in the descriptor kernel, so that epilogue subtiling is chosen more
often after autotuning.

---------

Co-authored-by: dePaul Miller <[email protected]>
This change is aimed at improving platform independence.

---------

Signed-off-by: Anatoly Myachev <[email protected]>
@whitneywhtsang whitneywhtsang self-assigned this Feb 1, 2025
@whitneywhtsang whitneywhtsang marked this pull request as ready for review February 2, 2025 00:33
@whitneywhtsang whitneywhtsang merged commit b9ba137 into main Feb 2, 2025
5 checks passed
@whitneywhtsang whitneywhtsang deleted the whitneywhtsang/merge branch February 2, 2025 06:32
@whitneywhtsang whitneywhtsang changed the title Merge OpenAI Triton commit e2c09d7 Merge OpenAI Triton commit a637eb2 Feb 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.