Update

[Blackwell] Enable MMA pipelining for scaled dot when TMEM copy is used (#5812) This PR enables MMA pipelining for scaled dot. The main difficulty this PR overcomes is the dependency cycle between TMEM copy rewriting and SWP - currently TMEM copy rewriting relies on SWP to put loading of scales into SMEM, while to apply MMA pipelining during SWP, TMEM copy rewriting needs to have happened beforehand. I propose to break the cycle by having loading of scales go through `local_alloc` and `local_load` in `AccelerateMatmul`. This way, TMEM copy rewriting happens during [the first call to OptimizedDotOperands,](https://github.com/triton-lang/triton/blob/1e0e51c4aeb3e1beea000da5d0e494f8b9ac40dd/third_party/nvidia/backend/compiler.py#L260) before SWP. And the local alloc and load added in `AccelerateMatmul` are eliminated during SWP. It's a bit ad hoc to add local alloc for scales there, since scales do not need to be in SMEM. But other solutions, like decoupling MMA pipelining from SWP, is more difficult. The other changes in this PR are for making SWP recognize loading of scales when there is TMEM copy between scale load and MMA. @ThomasRaoux @pawelszczerbuk @csullivan @mbrookhart @binarybana --------- Co-authored-by: Masahiro Masuda <[email protected]> Co-authored-by: Jason Knight <[email protected]> Fix default num_stages values mismatch between Python frontend and MLIR. (#5804) Address the discrepancy between the default value of num_stages in the Python frontend and the MLIR implementation. Previously, the values were not aligned, which could cause inconsistencies in behavior across Python Frontend and triton-opt. This PR updates the default value of num_stages in the Python frontend to match the ones used in MLIR. - [TritonGPUPipeline](https://github.com/triton-lang/triton/blob/032fa41a45847cdc00119ed3bdd5bc0adab9c938/include/triton/Dialect/TritonGPU/Transforms/Passes.td#L21) - [TritonGPUTestPipelineAssignLatencies](https://github.com/triton-lang/triton/blob/032fa41a45847cdc00119ed3bdd5bc0adab9c938/include/triton/Dialect/TritonGPU/Transforms/Passes.td#L58) - [TritonGPULoopScheduling](https://github.com/triton-lang/triton/blob/032fa41a45847cdc00119ed3bdd5bc0adab9c938/include/triton/Dialect/TritonGPU/Transforms/Passes.td#L266)  - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) Co-authored-by: pawelszczerbuk <[email protected]> Bump actions/checkout from 3 to 4 (#5714) Bumps [actions/checkout](https://github.com/actions/checkout) from 3 to 4. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/actions/checkout/releases">actions/checkout's releases</a>.</em></p> <blockquote> <h2>v4.0.0</h2> <h2>What's Changed</h2> <ul> <li>Update default runtime to node20 by <a href="https://github.com/takost"><code>@takost</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1436">actions/checkout#1436</a></li> <li>Support fetching without the --progress option by <a href="https://github.com/simonbaird"><code>@simonbaird</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1067">actions/checkout#1067</a></li> <li>Release 4.0.0 by <a href="https://github.com/takost"><code>@takost</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1447">actions/checkout#1447</a></li> </ul> <h2>New Contributors</h2> <ul> <li><a href="https://github.com/takost"><code>@takost</code></a> made their first contribution in <a href="https://redirect.github.com/actions/checkout/pull/1436">actions/checkout#1436</a></li> <li><a href="https://github.com/simonbaird"><code>@simonbaird</code></a> made their first contribution in <a href="https://redirect.github.com/actions/checkout/pull/1067">actions/checkout#1067</a></li> </ul> <p><strong>Full Changelog</strong>: <a href="https://github.com/actions/checkout/compare/v3...v4.0.0">https://github.com/actions/checkout/compare/v3...v4.0.0</a></p> <h2>v3.6.0</h2> <h2>What's Changed</h2> <ul> <li>Mark test scripts with Bash'isms to be run via Bash by <a href="https://github.com/dscho"><code>@dscho</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1377">actions/checkout#1377</a></li> <li>Add option to fetch tags even if fetch-depth > 0 by <a href="https://github.com/RobertWieczoreck"><code>@RobertWieczoreck</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/579">actions/checkout#579</a></li> <li>Release 3.6.0 by <a href="https://github.com/luketomlinson"><code>@luketomlinson</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1437">actions/checkout#1437</a></li> </ul> <h2>New Contributors</h2> <ul> <li><a href="https://github.com/RobertWieczoreck"><code>@RobertWieczoreck</code></a> made their first contribution in <a href="https://redirect.github.com/actions/checkout/pull/579">actions/checkout#579</a></li> <li><a href="https://github.com/luketomlinson"><code>@luketomlinson</code></a> made their first contribution in <a href="https://redirect.github.com/actions/checkout/pull/1437">actions/checkout#1437</a></li> </ul> <p><strong>Full Changelog</strong>: <a href="https://github.com/actions/checkout/compare/v3.5.3...v3.6.0">https://github.com/actions/checkout/compare/v3.5.3...v3.6.0</a></p> <h2>v3.5.3</h2> <h2>What's Changed</h2> <ul> <li>Fix: Checkout Issue in self hosted runner due to faulty submodule check-ins by <a href="https://github.com/megamanics"><code>@megamanics</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1196">actions/checkout#1196</a></li> <li>Fix typos found by codespell by <a href="https://github.com/DimitriPapadopoulos"><code>@DimitriPapadopoulos</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1287">actions/checkout#1287</a></li> <li>Add support for sparse checkouts by <a href="https://github.com/dscho"><code>@dscho</code></a> and <a href="https://github.com/dfdez"><code>@dfdez</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1369">actions/checkout#1369</a></li> <li>Release v3.5.3 by <a href="https://github.com/TingluoHuang"><code>@TingluoHuang</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1376">actions/checkout#1376</a></li> </ul> <h2>New Contributors</h2> <ul> <li><a href="https://github.com/megamanics"><code>@megamanics</code></a> made their first contribution in <a href="https://redirect.github.com/actions/checkout/pull/1196">actions/checkout#1196</a></li> <li><a href="https://github.com/DimitriPapadopoulos"><code>@DimitriPapadopoulos</code></a> made their first contribution in <a href="https://redirect.github.com/actions/checkout/pull/1287">actions/checkout#1287</a></li> <li><a href="https://github.com/dfdez"><code>@dfdez</code></a> made their first contribution in <a href="https://redirect.github.com/actions/checkout/pull/1369">actions/checkout#1369</a></li> </ul> <p><strong>Full Changelog</strong>: <a href="https://github.com/actions/checkout/compare/v3...v3.5.3">https://github.com/actions/checkout/compare/v3...v3.5.3</a></p> <h2>v3.5.2</h2> <h2>What's Changed</h2> <ul> <li>Fix: Use correct API url / endpoint in GHES by <a href="https://github.com/fhammerl"><code>@fhammerl</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1289">actions/checkout#1289</a> based on <a href="https://redirect.github.com/actions/checkout/issues/1286">#1286</a> by <a href="https://github.com/1newsr"><code>@1newsr</code></a></li> </ul> <p><strong>Full Changelog</strong>: <a href="https://github.com/actions/checkout/compare/v3.5.1...v3.5.2">https://github.com/actions/checkout/compare/v3.5.1...v3.5.2</a></p> <h2>v3.5.1</h2> <h2>What's Changed</h2> <ul> <li>Improve checkout performance on Windows runners by upgrading <code>@actions/github</code> dependency by <a href="https://github.com/BrettDong"><code>@BrettDong</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1246">actions/checkout#1246</a></li> </ul> <h2>New Contributors</h2> <ul> <li><a href="https://github.com/BrettDong"><code>@BrettDong</code></a> made their first contribution in <a href="https://redirect.github.com/actions/checkout/pull/1246">actions/checkout#1246</a></li> </ul>  </blockquote> <p>... (truncated)</p> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/actions/checkout/blob/main/CHANGELOG.md">actions/checkout's changelog</a>.</em></p> <blockquote> <h1>Changelog</h1> <h2>v4.2.2</h2> <ul> <li><code>url-helper.ts</code> now leverages well-known environment variables by <a href="https://github.com/jww3"><code>@jww3</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1941">actions/checkout#1941</a></li> <li>Expand unit test coverage for <code>isGhes</code> by <a href="https://github.com/jww3"><code>@jww3</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1946">actions/checkout#1946</a></li> </ul> <h2>v4.2.1</h2> <ul> <li>Check out other refs/* by commit if provided, fall back to ref by <a href="https://github.com/orhantoy"><code>@orhantoy</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1924">actions/checkout#1924</a></li> </ul> <h2>v4.2.0</h2> <ul> <li>Add Ref and Commit outputs by <a href="https://github.com/lucacome"><code>@lucacome</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1180">actions/checkout#1180</a></li> <li>Dependency updates by <a href="https://github.com/dependabot"><code>@dependabot</code></a>- <a href="https://redirect.github.com/actions/checkout/pull/1777">actions/checkout#1777</a>, <a href="https://redirect.github.com/actions/checkout/pull/1872">actions/checkout#1872</a></li> </ul> <h2>v4.1.7</h2> <ul> <li>Bump the minor-npm-dependencies group across 1 directory with 4 updates by <a href="https://github.com/dependabot"><code>@dependabot</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1739">actions/checkout#1739</a></li> <li>Bump actions/checkout from 3 to 4 by <a href="https://github.com/dependabot"><code>@dependabot</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1697">actions/checkout#1697</a></li> <li>Check out other refs/* by commit by <a href="https://github.com/orhantoy"><code>@orhantoy</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1774">actions/checkout#1774</a></li> <li>Pin actions/checkout's own workflows to a known, good, stable version. by <a href="https://github.com/jww3"><code>@jww3</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1776">actions/checkout#1776</a></li> </ul> <h2>v4.1.6</h2> <ul> <li>Check platform to set archive extension appropriately by <a href="https://github.com/cory-miller"><code>@cory-miller</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1732">actions/checkout#1732</a></li> </ul> <h2>v4.1.5</h2> <ul> <li>Update NPM dependencies by <a href="https://github.com/cory-miller"><code>@cory-miller</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1703">actions/checkout#1703</a></li> <li>Bump github/codeql-action from 2 to 3 by <a href="https://github.com/dependabot"><code>@dependabot</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1694">actions/checkout#1694</a></li> <li>Bump actions/setup-node from 1 to 4 by <a href="https://github.com/dependabot"><code>@dependabot</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1696">actions/checkout#1696</a></li> <li>Bump actions/upload-artifact from 2 to 4 by <a href="https://github.com/dependabot"><code>@dependabot</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1695">actions/checkout#1695</a></li> <li>README: Suggest <code>user.email</code> to be <code>41898282+github-actions[bot]@users.noreply.github.com</code> by <a href="https://github.com/cory-miller"><code>@cory-miller</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1707">actions/checkout#1707</a></li> </ul> <h2>v4.1.4</h2> <ul> <li>Disable <code>extensions.worktreeConfig</code> when disabling <code>sparse-checkout</code> by <a href="https://github.com/jww3"><code>@jww3</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1692">actions/checkout#1692</a></li> <li>Add dependabot config by <a href="https://github.com/cory-miller"><code>@cory-miller</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1688">actions/checkout#1688</a></li> <li>Bump the minor-actions-dependencies group with 2 updates by <a href="https://github.com/dependabot"><code>@dependabot</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1693">actions/checkout#1693</a></li> <li>Bump word-wrap from 1.2.3 to 1.2.5 by <a href="https://github.com/dependabot"><code>@dependabot</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1643">actions/checkout#1643</a></li> </ul> <h2>v4.1.3</h2> <ul> <li>Check git version before attempting to disable <code>sparse-checkout</code> by <a href="https://github.com/jww3"><code>@jww3</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1656">actions/checkout#1656</a></li> <li>Add SSH user parameter by <a href="https://github.com/cory-miller"><code>@cory-miller</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1685">actions/checkout#1685</a></li> <li>Update <code>actions/checkout</code> version in <code>update-main-version.yml</code> by <a href="https://github.com/jww3"><code>@jww3</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1650">actions/checkout#1650</a></li> </ul> <h2>v4.1.2</h2> <ul> <li>Fix: Disable sparse checkout whenever <code>sparse-checkout</code> option is not present <a href="https://github.com/dscho"><code>@dscho</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1598">actions/checkout#1598</a></li> </ul> <h2>v4.1.1</h2> <ul> <li>Correct link to GitHub Docs by <a href="https://github.com/peterbe"><code>@peterbe</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1511">actions/checkout#1511</a></li> <li>Link to release page from what's new section by <a href="https://github.com/cory-miller"><code>@cory-miller</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1514">actions/checkout#1514</a></li> </ul> <h2>v4.1.0</h2> <ul> <li><a href="https://redirect.github.com/actions/checkout/pull/1396">Add support for partial checkout filters</a></li> </ul>  </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/actions/checkout/commit/11bd71901bbe5b1630ceea73d27597364c9af683"><code>11bd719</code></a> Prepare 4.2.2 Release (<a href="https://redirect.github.com/actions/checkout/issues/1953">#1953</a>)</li> <li><a href="https://github.com/actions/checkout/commit/e3d2460bbb42d7710191569f88069044cfb9d8cf"><code>e3d2460</code></a> Expand unit test coverage (<a href="https://redirect.github.com/actions/checkout/issues/1946">#1946</a>)</li> <li><a href="https://github.com/actions/checkout/commit/163217dfcd28294438ea1c1c149cfaf66eec283e"><code>163217d</code></a> <code>url-helper.ts</code> now leverages well-known environment variables. (<a href="https://redirect.github.com/actions/checkout/issues/1941">#1941</a>)</li> <li><a href="https://github.com/actions/checkout/commit/eef61447b9ff4aafe5dcd4e0bbf5d482be7e7871"><code>eef6144</code></a> Prepare 4.2.1 release (<a href="https://redirect.github.com/actions/checkout/issues/1925">#1925</a>)</li> <li><a href="https://github.com/actions/checkout/commit/6b42224f41ee5dfe5395e27c8b2746f1f9955030"><code>6b42224</code></a> Add workflow file for publishing releases to immutable action package (<a href="https://redirect.github.com/actions/checkout/issues/1919">#1919</a>)</li> <li><a href="https://github.com/actions/checkout/commit/de5a000abf73b6f4965bd1bcdf8f8d94a56ea815"><code>de5a000</code></a> Check out other refs/* by commit if provided, fall back to ref (<a href="https://redirect.github.com/actions/checkout/issues/1924">#1924</a>)</li> <li><a href="https://github.com/actions/checkout/commit/d632683dd7b4114ad314bca15554477dd762a938"><code>d632683</code></a> Prepare 4.2.0 release (<a href="https://redirect.github.com/actions/checkout/issues/1878">#1878</a>)</li> <li><a href="https://github.com/actions/checkout/commit/6d193bf28034eafb982f37bd894289fe649468fc"><code>6d193bf</code></a> Bump braces from 3.0.2 to 3.0.3 (<a href="https://redirect.github.com/actions/checkout/issues/1777">#1777</a>)</li> <li><a href="https://github.com/actions/checkout/commit/db0cee9a514becbbd4a101a5fbbbf47865ee316c"><code>db0cee9</code></a> Bump the minor-npm-dependencies group across 1 directory with 4 updates (<a href="https://redirect.github.com/actions/checkout/issues/1872">#1872</a>)</li> <li><a href="https://github.com/actions/checkout/commit/b6849436894e144dbce29d7d7fda2ae3bf9d8365"><code>b684943</code></a> Add Ref and Commit outputs (<a href="https://redirect.github.com/actions/checkout/issues/1180">#1180</a>)</li> <li>Additional commits viewable in <a href="https://github.com/actions/checkout/compare/v3...v4">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=actions/checkout&package-manager=github_actions&previous-version=3&new-version=4)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> [mlir][dialect] Refactor DotLike trait into a DotOpInterface + Enable verification of scaled_dot (#5796) The Triton MLIR dialect presently has a `DotLike` trait which DotOps have. The problem with this trait is the way it is currently implemented prevents `scaled_dot` from being verified properly (dimensions are not properly checked at the moment: "TODO: enable back with an interface to support scaled dot."). This PR refactors the "DotLike" trait into an interface which implements a "verifyDims" function that checks if the dims for the A and B operands are compatible (e.g., something like MxK1 and K2xN; k1==k2; in the simple case). The initial implementation of DotOpInterface is similar to the prior `DotLike` trait with the exception that it includes the `verifyDims` function which all DotOps must implement---this function just checks whether the dimensions of the A, B inputs match. In the future this interface can be extended to include more functionality. I think since this enables the verifier for `scaled_dot`, that the existing scaled dot tests should cover any changes made in the PR---but if this is wrong I will add additional tests. ================================================================  - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x] This PR does not need a test because `I think this should be covered by existing tests`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) [ANALYSIS] Fixing overflow in AxisInfo.cpp. (#5821) UBSan detected that the shift is happening on an int32 variable, but the shift amount could be larger than what int32 supports. I believe the change is trivial to require a test, but please advice accordingly if you think there is an appropriate way to capture this. [Blackwell][Clean up] Remove use of SharedMemoryObject on TMEM (#5817) Using `getSharedMemoryObjectFromStruct` etc on TMEM is very confusing for new readers, so I'm introducing simpler alternatives for TMEM. In practice, we only need the base address of TMEM. @ThomasRaoux --------- Co-authored-by: Masahiro Masuda <[email protected]> [AMD] Support lowering GPU async copy/commit/wait ops (#5729) Support lowering of `ttg.async_copy_global_to_local` for `gfx9` GPUs. The lowering does check if the resulting writes are coalesced which is a requirement by the hardware. Also associated `ttg.async_commit_group` and `ttg.async_wait`. Note that we are currently not emitting `AsyncCopyGlobalToLocal` for AMD targets, this will come with a follow up PR. [AMD] Remove duplicate definition of use_buffer_ops (#5828) Removes a duplicate definition of use_buffer_ops when loading the `AMDGCN_USE_BUFFER_OPS` environment variable, likely due to a merge conflict. Co-authored-by: Nick Riasanovsky <[email protected]> [PIPELINE] Relax requirements for wgmma operand register pipelining (#5810) Pipeline wgmma operands in registers always if they are not suitable for shmem pipelining. This introduces a performance regression on attention workloads, but fixes a crash introduced by #5798. [Blackwell][TUTORIALS] Add tutorial 10-block-scaled-matmul.py (#5813) This tutorial demos Triton support for block scaled matrix multiply on Blackwell's 5th generation tensor core with low precision FP4 and FP8 datatypes. Planned followups include optimized TMA loads for block scale factors, and mixed precision support. Additional changes * Moves MX dtype helper classes to triton/tools/mxfp.py for use in tutorials as well as test code. @ThomasRaoux @pawelszczerbuk @masahi @mbrookhart @binarybana Revert "[AMD] Use more efficient fp32 to bf16 type conversion (#5633)" (#5829) This reverts commit 1c28e08 because the inline assembly causing issues for LLVM backend. [TensorDescriptor] Support ndim > 2 loads and stores (#5830) This adds support for 3, 4 and 5 dimensional tensor descriptors with tests that load and store work correctly. It was surprisingly easy as most of the code was already written generically, there were only a few changes needed to the TMA creation lowering, and a bug-fix for LL conversion of NVMMA encoding with >2 dims. [AMD] Avoid unneeded instruction schedule pass runs (#5832) We only want to run them if an active variant is requested. [PIPELINE] Including mmav5 dot in shmem pipelining (#5835) There is a bug introduced in: [PIPELINE] Relax requirements for wgmma operand register pipelining (#5810) I was missing mmav5 dot support. Thanks for @peterbell10 for catching that and for @masahi for proposing a fix. Disable dot transpose optimization due to perf regressions (#5834) There are cases where transposing the dot op may cause performance problems. For example, if the accumulator is being used in something like: ``` %206 = ttg.convert_layout %205 : tensor<128x64xf16, #mma> -> tensor<128x64xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> loc(#loc216) ``` transposing it makes the conversion expensive. This is likely to happen in attention kernels. [AMD] Attach variant to the scheduling hint op (#5808) This PR refactors the implementation of instruction scheduling infrastructure. A particular "sched" variant becomes a part of the instruction and gets added during the insertion pass. The instruction carries this meta-information over many passes allowing to re-use the same mechanism in some other places. doc: update readme with dev container info (#5819) Add a reference to https://github.com/redhat-et/triton-dev-containers which contains a set of Development Containers for Triton. Development containers have the following Key benefits: - **Consistency**: All developers can work with the same development environment, ensuring uniform behavior across different systems. - **Isolation**: The container prevents potential conflicts with software installed on your local machine. - **Portability**: Easily share the development environment with team members, minimizing onboarding time and setup issues.  - [ ] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x] This PR does not need a test because `It's just a documentation change`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) Signed-off-by: Maryam Tahhan <[email protected]> [NFC] Finish clean-up of #5834 (#5837) That partial revert left a couple auxiliary functions that are not currently needed. [AMD] Avoiding undefined behaviour due to llvm_unreachable (#5822) It is intended to always abort, but sometimes does something else. For example, this test deadlocks on ARM sometimes, resulting in increased flakyness. This pattern has also been used elsewhere in the file, so this makes them consistent. [AMD] Enable block pingpong for smaller tiles (#5820) Recent experiment found it also helps few more configs especially smaller tiles. Enable one cluster pingpong for the 4 times smaller tiles. [Pipeliner] Fix crash in rewriting TMA descriptor updates (#5843) Lots of our code assumes that `scf.if` has a non-empty else region, but sometimes it can be empty, which typically happens due to one of the `scf.if` canonicalizers. Just make sure to create `scf.if` with non-empty regions. This was split off from #5726 since others were hitting the crash. [Pipelinier] Fix mmav3 pipelining (#5844) Make sure we allocate the right number of slices when doing mmav3 pipelining. [BACKEND] bump to llvm/llvm-project@ffe3129e9bdc (#5814) Pulls in llvm/llvm-project#125268 which is necessary for #5563. [AMD][BACKEND] Enable cache ctrl bits for gfx950 (#5838) Enables cache control bits based on Triton's `cacheModifier` on gfx950. They are identical compared to `gfx942`. [DOC] Fix generating docs for tutorials (#5850) Do not `exit` when generating tutorials, which causes the whole documentation pipeline exit. [Tests] Re-enable tests and clean up test_matmul (#5853) Reduce unnecessary config, re-enable a test and simplify simple_matmul [Blackwell][Clean up] Introduce interface for MMAv5 ops (#5848) The goal is to let `TCGen5MMAOp` and `TCGen5MMAScaledOp` share an interface so that the rest of code can work generically with them. The MMA pipelining pass gets cleaned up a lot, and the accum init flag optimization is now automatically enabled for `TCGen5MMAScaledOp` as well. --------- Co-authored-by: Masahiro Masuda <[email protected]> [NFC] Remove custom backend callback and move it to TargetInfo (#5854) [LAYOUTS] Implement generically getElemsPerThread (#5841) While doing so, we remove the SliceEncodingAttr hack! [Pipeliner] Enable automatic loop fusion (#5726) This PR turns on automatic loop fusion in the CUDA >= 8.0 pass pipelines. Automatic loop fusion is only enabled for simple loop nests (1 outer loop, 1 inner loop), when the user requests fusion with `tl.range(..., fuse=True)` in the frontend. This PR also rewrites the persistent matmul examples to use loop nests. This is cleaner, but will also enable more powerful and flexible optimizations of loop nests in the future. Primarily, it hides the brittleless of the pipeliner behind a single layer inside the compiler, so ideally the brittleness needs to be dealt with only once and hidden from users. To achieve this, several things have been added to loop fusion: 1. To avoid generating the inner loop inside a conditional, loop nest fusion will "speculate" the length of the inner loop, essentially generating a branch where the inner loop is missing and one where the inner loop is always known to execute at least once. 2. Codegen of the loop induction variables has been slightly altered to better match the expectations of the scheduler, pipeliner(s), and `optimize-accumulator-init`. 3. Codegen of loop iter args has been altered to generate fewer SSA dependencies between the prologue, inner loop, and epilogue, making it more likely for pipelining to be successful. E.g., inner loop iter args that can be initialized outside the loop and reset in the epilogue are done so, rather than in the prologue. Some other things in this PR: * Fixed a bug in the pipeline expander * Added AxisInfo implementation for `ub::PoisonOp` I verified the performance of the rewritten persistent matmul kernels on H100 and Blackwell. Performance of `09-persistent-matmul.py` on H100. Before (2 runs) ``` root@dev-0:~/code/triton$ python python/tutorials/09-persistent-matmul.py M=32, N=32, K=32 verification naive vs: torch: ✅ cublas: ✅ persistent: ✅ TMA persistent: ✅ Tensor descriptor persistent: ✅ M=8192, N=8192, K=512 verification naive vs: torch: ✅ cublas: ✅ persistent: ✅ TMA persistent: ✅ Tensor descriptor persistent: ✅ 273.146 4025.362 ROOT ├─ nan 0.031 _ZN2at6native18elementwise_kernelILi128ELi4EZNS0_22gpu_kernel_impl_nocastIZZZNS0_23direct_copy_kernel_cudaERNS_18TensorIteratorBaseEENKUlvE1_clEvENKUlvE8_clEvEUlN3c104HalfEE_EEvS4_RKT_EUliE_EEviT1_ ├─ nan 0.027 _ZN2at6native54_GLOBAL__N__a236ace4_21_DistributionNormal_cu_0c5b6e8543distribution_elementwise_grid_stride_kernelIfLi4EZNS0_9templates4cuda20normal_and_transformIN3c104HalfEfLm4EPNS_17CUDAGeneratorImplEZZZNS4_13normal_kernelIS9_EEvRKNS_10TensorBaseEddT_ENKUlvE_clEvENKUlvE1_clEvEUlfE_EEvRNS_18TensorIteratorBaseET2_T3_EUlP24curandStatePhilox4_32_10E0_ZNS1_27distribution_nullary_kernelIS7_fLi4ES9_SO_SH_EEvSJ_SK_RKSL_T4_EUlifE_EEviNS_15PhiloxCudaStateET1_SK_ ├─ 283.506 2666.310 cublas [M=8192, N=8192, K=512] │ └─ nan 2666.310 sm90_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_off_kernel__5x_cublas ├─ 223.326 307.709 matmul_kernel [M=8192, N=8192, K=512] ├─ 259.293 265.027 matmul_kernel_descriptor_persistent [M=8192, N=8192, K=512] ├─ 238.500 288.133 matmul_kernel_persistent [M=8192, N=8192, K=512] ├─ 258.738 265.594 matmul_kernel_tma_persistent [M=8192, N=8192, K=512] └─ 295.529 232.531 torch [M=8192, N=8192, K=512] └─ nan 232.531 sm90_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_off_kernel__5x_cublas Legend (Metric: tflop16/s (inc) Min: 223.33 Max: 295.53) █ 288.31 - 295.53 █ 273.87 - 288.31 █ 259.43 - 273.87 █ 244.99 - 259.43 █ 230.55 - 244.99 █ 223.33 - 230.55 name User code ◀ Only in left graph ▶ Only in right graph root@dev-0:~/code/triton$ python python/tutorials/09-persistent-matmul.py M=32, N=32, K=32 verification naive vs: torch: ✅ cublas: ✅ persistent: ✅ TMA persistent: ✅ Tensor descriptor persistent: ✅ M=8192, N=8192, K=512 verification naive vs: torch: ✅ cublas: ✅ persistent: ✅ TMA persistent: ✅ Tensor descriptor persistent: ✅ 273.367 4022.105 ROOT ├─ nan 0.031 _ZN2at6native18elementwise_kernelILi128ELi4EZNS0_22gpu_kernel_impl_nocastIZZZNS0_23direct_copy_kernel_cudaERNS_18TensorIteratorBaseEENKUlvE1_clEvENKUlvE8_clEvEUlN3c104HalfEE_EEvS4_RKT_EUliE_EEviT1_ ├─ nan 0.027 _ZN2at6native54_GLOBAL__N__a236ace4_21_DistributionNormal_cu_0c5b6e8543distribution_elementwise_grid_stride_kernelIfLi4EZNS0_9templates4cuda20normal_and_transformIN3c104HalfEfLm4EPNS_17CUDAGeneratorImplEZZZNS4_13normal_kernelIS9_EEvRKNS_10TensorBaseEddT_ENKUlvE_clEvENKUlvE1_clEvEUlfE_EEvRNS_18TensorIteratorBaseET2_T3_EUlP24curandStatePhilox4_32_10E0_ZNS1_27distribution_nullary_kernelIS7_fLi4ES9_SO_SH_EEvSJ_SK_RKSL_T4_EUlifE_EEviNS_15PhiloxCudaStateET1_SK_ ├─ 284.284 2659.011 cublas [M=8192, N=8192, K=512] │ └─ nan 2659.011 sm90_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_off_kernel__5x_cublas ├─ 221.823 309.795 matmul_kernel [M=8192, N=8192, K=512] ├─ 254.755 269.748 matmul_kernel_descriptor_persistent [M=8192, N=8192, K=512] ├─ 240.774 285.411 matmul_kernel_persistent [M=8192, N=8192, K=512] ├─ 259.109 265.214 matmul_kernel_tma_persistent [M=8192, N=8192, K=512] └─ 295.100 232.868 torch [M=8192, N=8192, K=512] └─ nan 232.868 sm90_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_off_kernel__5x_cublas Legend (Metric: tflop16/s (inc) Min: 221.82 Max: 295.10) █ 287.77 - 295.10 █ 273.12 - 287.77 █ 258.46 - 273.12 █ 243.81 - 258.46 █ 229.15 - 243.81 █ 221.82 - 229.15 name User code ◀ Only in left graph ▶ Only in right graph ``` After: ``` root@dev-0:~/code/triton$ python python/tutorials/09-persistent-matmul.py M=32, N=32, K=32 verification naive vs: torch: ✅ cublas: ✅ persistent: ✅ TMA persistent: ✅ Tensor descriptor persistent: ✅ M=8192, N=8192, K=512 verification naive vs: torch: ✅ cublas: ✅ persistent: ✅ TMA persistent: ✅ Tensor descriptor persistent: ✅ 274.040 4012.227 ROOT ├─ nan 0.031 _ZN2at6native18elementwise_kernelILi128ELi4EZNS0_22gpu_kernel_impl_nocastIZZZNS0_23direct_copy_kernel_cudaERNS_18TensorIteratorBaseEENKUlvE1_clEvENKUlvE8_clEvEUlN3c104HalfEE_EEvS4_RKT_EUliE_EEviT1_ ├─ nan 0.027 _ZN2at6native54_GLOBAL__N__a236ace4_21_DistributionNormal_cu_0c5b6e8543distribution_elementwise_grid_stride_kernelIfLi4EZNS0_9templates4cuda20normal_and_transformIN3c104HalfEfLm4EPNS_17CUDAGeneratorImplEZZZNS4_13normal_kernelIS9_EEvRKNS_10TensorBaseEddT_ENKUlvE_clEvENKUlvE1_clEvEUlfE_EEvRNS_18TensorIteratorBaseET2_T3_EUlP24curandStatePhilox4_32_10E0_ZNS1_27distribution_nullary_kernelIS7_fLi4ES9_SO_SH_EEvSJ_SK_RKSL_T4_EUlifE_EEviNS_15PhiloxCudaStateET1_SK_ ├─ 285.369 2648.904 cublas [M=8192, N=8192, K=512] │ └─ nan 2648.904 sm90_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_off_kernel__5x_cublas ├─ 217.548 315.881 matmul_kernel [M=8192, N=8192, K=512] ├─ 262.312 261.976 matmul_kernel_descriptor_persistent [M=8192, N=8192, K=512] ├─ 244.740 280.785 matmul_kernel_persistent [M=8192, N=8192, K=512] ├─ 255.113 269.368 matmul_kernel_tma_persistent [M=8192, N=8192, K=512] └─ 292.108 235.253 torch [M=8192, N=8192, K=512] └─ nan 235.253 sm90_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_off_kernel__5x_cublas Legend (Metric: tflop16/s (inc) Min: 217.55 Max: 292.11) █ 284.65 - 292.11 █ 269.74 - 284.65 █ 254.83 - 269.74 █ 239.92 - 254.83 █ 225.00 - 239.92 █ 217.55 - 225.00 name User code ◀ Only in left graph ▶ Only in right graph root@dev-0:~/code/triton$ python python/tutorials/09-persistent-matmul.py M=32, N=32, K=32 verification naive vs: torch: ✅ cublas: ✅ persistent: ✅ TMA persistent: ✅ Tensor descriptor persistent: ✅ M=8192, N=8192, K=512 verification naive vs: torch: ✅ cublas: ✅ persistent: ✅ TMA persistent: ✅ Tensor descriptor persistent: ✅ 274.997 3998.267 ROOT ├─ nan 0.031 _ZN2at6native18elementwise_kernelILi128ELi4EZNS0_22gpu_kernel_impl_nocastIZZZNS0_23direct_copy_kernel_cudaERNS_18TensorIteratorBaseEENKUlvE1_clEvENKUlvE8_clEvEUlN3c104HalfEE_EEvS4_RKT_EUliE_EEviT1_ ├─ nan 0.027 _ZN2at6native54_GLOBAL__N__a236ace4_21_DistributionNormal_cu_0c5b6e8543distribution_elementwise_grid_stride_kernelIfLi4EZNS0_9templates4cuda20normal_and_transformIN3c104HalfEfLm4EPNS_17CUDAGeneratorImplEZZZNS4_13normal_kernelIS9_EEvRKNS_10TensorBaseEddT_ENKUlvE_clEvENKUlvE1_clEvEUlfE_EEvRNS_18TensorIteratorBaseET2_T3_EUlP24curandStatePhilox4_32_10E0_ZNS1_27distribution_nullary_kernelIS7_fLi4ES9_SO_SH_EEvSJ_SK_RKSL_T4_EUlifE_EEviNS_15PhiloxCudaStateET1_SK_ ├─ 285.498 2647.706 cublas [M=8192, N=8192, K=512] │ └─ nan 2647.706 sm90_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_off_kernel__5x_cublas ├─ 217.884 315.394 matmul_kernel [M=8192, N=8192, K=512] ├─ 262.534 261.755 matmul_kernel_descriptor_persistent [M=8192, N=8192, K=512] ├─ 246.617 278.649 matmul_kernel_persistent [M=8192, N=8192, K=512] ├─ 262.525 261.764 matmul_kernel_tma_persistent [M=8192, N=8192, K=512] └─ 295.007 232.942 torch [M=8192, N=8192, K=512] └─ nan 232.942 sm90_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_off_kernel__5x_cublas Legend (Metric: tflop16/s (inc) Min: 217.88 Max: 295.01) █ 287.29 - 295.01 █ 271.87 - 287.29 █ 256.45 - 271.87 █ 241.02 - 256.45 █ 225.60 - 241.02 █ 217.88 - 225.60 name User code ◀ Only in left graph ▶ Only in right graph ``` [Blackwell] Hoist constant TMem allocation out of the loop (#5857) For cases where TMem is constant hoisting the allocation of the loop avoid having to store multiple times. [AMD] Use LLVM ops for fp16<->fp32 casts (#5859) Inline assembly can be a blocker for LLVM backend to optimize. [BW][PIPELINE] Add an option to tl.range to disallow accumulator multi-buffering (#5858) Rework mmav5 pipelining to allow pipelining of mma when multibuffering of the accumulator is impossible by putting uses in the same stage as the mma and blocking on wait until current mma finishes. Based on this support, introducing new flag to `tl.range` that controls if multibuffering of the accumulator of the dots in the loop is allowed. Without the mentioned rework of mmav5 pipelining we would simply not pipeline cases where mutibuffering is disallowed. [AMD] NFC: Refactor DotOpMFMAConversionHelper (#5862) This PR refactored `DotOpMFMAConversionHelper` by extracting utility functions from `convertDot` to make it easier to be extended in #5845. [AMD][CI] Switch to rocm hosted docker image (#5855) The CI docker image has been moved from `rocmshared/pytorch` to `rocm/pytorch` so we need to adjust our workflow. [triton][NFC] Refactor build function of ScanOp (#5861) It's better to use ScanOp's build function to replace ReduceOp's.  - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.)
triton-lang · Feb 10, 2025 · c9a21b4 · c9a21b4
1 parent 61e4b07
commit c9a21b4
Show file tree

Hide file tree

Showing 132 changed files with 4,071 additions and 1,839 deletions.
diff --git a/.github/workflows/integration-tests.yml b/.github/workflows/integration-tests.yml
@@ -288,7 +288,7 @@ jobs:
         runner: ${{fromJson(needs.Runner-Preparation.outputs.matrix-HIP)}}
     name: Integration-Tests (${{matrix.runner[1] == 'gfx90a' && 'mi210' || 'mi300x'}})
     container:
-      image: rocmshared/pytorch:rocm6.2.2_ubuntu22.04_py3.10_pytorch_2.5.1_asan
+      image: rocm/pytorch:rocm6.2.2_ubuntu22.04_py3.10_pytorch_2.5.1_asan
       options: --device=/dev/kfd --device=/dev/dri --security-opt seccomp=unconfined --group-add video --user root
     steps:
       - name: Checkout

diff --git a/.github/workflows/integration-tests.yml.in b/.github/workflows/integration-tests.yml.in
@@ -334,7 +334,7 @@ jobs:
     name: Integration-Tests (${{matrix.runner[1] == 'gfx90a' && 'mi210' || 'mi300x'}})
 
     container:
-      image: rocmshared/pytorch:rocm6.2.2_ubuntu22.04_py3.10_pytorch_2.5.1_asan
+      image: rocm/pytorch:rocm6.2.2_ubuntu22.04_py3.10_pytorch_2.5.1_asan
       options: --device=/dev/kfd --device=/dev/dri --security-opt seccomp=unconfined --group-add video --user root
 
     steps:

diff --git a/.github/workflows/wheels_v2.yml b/.github/workflows/wheels_v2.yml
@@ -22,7 +22,7 @@ jobs:
           docker container prune -f
 
       - name: Checkout
-        uses: actions/checkout@v3
+        uses: actions/checkout@v4
 
       # The LATEST_DATE here should be kept in sync with the one in Patch setup.py
       - id: check-version

diff --git a/README.md b/README.md
@@ -284,3 +284,21 @@ Supported Hardware:
 - NVIDIA GPUs (Compute Capability 8.0+)
 - AMD GPUs (ROCm 6.2+)
 - Under development: CPUs
+
+# Development Container (Dev Container)
+
+**Dev Containers** for the Triton project are available from
+the [triton-dev-containers repository](https://github.com/redhat-et/triton-dev-containers)
+
+### Key Benefits:
+- **Consistency**: All developers can work with the same development
+  environment, ensuring uniform behavior across different systems.
+- **Isolation**: The container prevents potential conflicts with software
+  installed on your local machine.
+- **Portability**: Easily share the development environment with team members,
+  minimizing onboarding time and setup issues.
+
+### How to Use the Dev Container:
+
+For detailed instructions on how to use the dev containers please see
+the [dev container user guide](https://github.com/redhat-et/triton-dev-containers/blob/main/.devcontainer/devcontainer.md)
diff --git a/cmake/llvm-hash.txt b/cmake/llvm-hash.txt
@@ -1 +1 @@
-c118864223c6309378cd704f3406533474c2759f
+ffe3129e9bdc146ee4d91e849173d1c64b1ae974
diff --git a/include/triton/Conversion/TritonGPUToLLVM/PatternTritonGPUOpToLLVM.h b/include/triton/Conversion/TritonGPUToLLVM/PatternTritonGPUOpToLLVM.h
@@ -28,20 +28,6 @@ constexpr int patternBenefitClampOptimizedPattern = 20;
 constexpr int patternBenefitConvertLayoutOptimizedPattern = 20;
 constexpr int patternBenefitNvidiaTensorCoreSubviewPattern = 20;
 
-struct BackendCallbacks {
-  /**
-   * A backend-specific callback for appending auxiliary data during
-   * `LocalStoreOp` conversion.
-   *
-   * @param[in] op The reference to the re-written `LocalStoreOp`.
-   * @param[in] count The number of issued LLVM instructions.
-   * @param[in] type The input type of issued LLVM instructions.
-   */
-  std::function<void(triton::gpu::LocalStoreOp op, size_t llvmOpCount,
-                     Type llvmOpType)>
-      localStoreOpConversion = nullptr;
-};
-
 void populateElementwiseOpToLLVMPatterns(
     LLVMTypeConverter &typeConverter, RewritePatternSet &patterns,
     ModuleAxisInfoAnalysis &axisInfoAnalysis, const TargetInfoBase &targetInfo,
@@ -51,10 +37,10 @@ void populateElementwiseOpToLLVMPatterns(
 // callback receives 1) the current source op, 2) the number of issued LLVM
 // instructions and 3) their input types. Each MLIR backend can provide a
 // callback and, thus, handle backend-specific behaviors.
-void populateMemoryOpToLLVMPatterns(
-    LLVMTypeConverter &typeConverter, const TargetInfoBase &targetInfo,
-    RewritePatternSet &patterns, PatternBenefit benefit,
-    std::optional<BackendCallbacks> backendCallbacks = std::nullopt);
+void populateMemoryOpToLLVMPatterns(LLVMTypeConverter &typeConverter,
+                                    const TargetInfoBase &targetInfo,
+                                    RewritePatternSet &patterns,
+                                    PatternBenefit benefit);
 
 void populateAssertOpToLLVMPattern(LLVMTypeConverter &typeConverter,
                                    RewritePatternSet &patterns,

diff --git a/include/triton/Conversion/TritonGPUToLLVM/TargetInfoBase.h b/include/triton/Conversion/TritonGPUToLLVM/TargetInfoBase.h
@@ -91,6 +91,11 @@ class TargetInfoBase {
 
   virtual bool supportVectorizedAtomics() const = 0;
 
+  // Helper used by targets to annotate store operations during lowering to
+  // llvm.
+  virtual void storeOpAnnotation(triton::gpu::LocalStoreOp op,
+                                 size_t localStoreOpCount, Type type) const {}
+
   virtual ~TargetInfoBase() {}
 };
 } // namespace mlir::triton

diff --git a/include/triton/Dialect/Triton/IR/OpInterfaces.h b/include/triton/Dialect/Triton/IR/OpInterfaces.h
@@ -11,6 +11,8 @@ namespace impl {
 
 LogicalResult verifyTransposeOpInterface(Operation *op);
 
+LogicalResult verifyDotOpInterface(Operation *op);
+
 } // namespace impl
 
 } // namespace triton

diff --git a/include/triton/Dialect/Triton/IR/Traits.h b/include/triton/Dialect/Triton/IR/Traits.h
@@ -58,53 +58,6 @@ class VerifyTensorLayoutsTrait
   }
 };
 
-// Verify if the op is a dot-like operation.
-// A dot-like operation should have three operands.
-// The first two operands should share a common dimension, and the result
-// should have the dimensions of the two operands that are not shared.
-// A dot-like operation can be either 2d or 3d.
-// In the 3d case, the first dimension of operands is the batch dimension.
-template <class ConcreteType>
-class DotLike : public TraitBase<ConcreteType, DotLike> {
-public:
-  static LogicalResult verifyTrait(Operation *op) {
-    if (op->getNumOperands() < 3)
-      return op->emitOpError("expected at least 3 operands");
-    auto aTy = cast<ShapedType>(op->getOperand(0).getType());
-    auto bTy = cast<ShapedType>(op->getOperand(1).getType());
-    auto cTy = cast<ShapedType>(op->getOperand(2).getType());
-    auto aShape = aTy.getShape();
-    auto bShape = bTy.getShape();
-    auto cShape = cTy.getShape();
-    // Check if all 3d or all 2d
-    if (aShape.size() != 2 && aShape.size() != 3)
-      return op->emitOpError("expected operands to be 2d or 3d");
-    if (aShape.size() != bShape.size() || aShape.size() != cShape.size())
-      return op->emitOpError("expected all operands to have the same rank");
-    // Check if the first two operands share a common dimension
-    // TODO: enable back with an interface to support scaled dot.
-    // if (aShape[aShape.size() - 1] != bShape[aShape.size() - 2])
-    //   return op->emitOpError("expected the last dimension of the first
-    //   operand "
-    //                          "to be equal to the second-to-last dimension of
-    //                          " "the second operand");
-    // Check the batch dimension
-    if (aShape.size() == 3 &&
-        (aShape[0] != cShape[0] || bShape[0] != cShape[0]))
-      return op->emitOpError("expected the first dimension of the first "
-                             "operand to be equal to the first dimension of "
-                             "the result");
-    // Check the output shape
-    if (cShape[cShape.size() - 2] != aShape[aShape.size() - 2] ||
-        cShape[cShape.size() - 1] != bShape[aShape.size() - 1])
-      return op->emitOpError(
-          "expected the output shape to be the concatenation of the last "
-          "dimension of the first operand and the last dimension of the "
-          "second ");
-    return success();
-  }
-};
-
 template <typename ConcreteType>
 class SameOperandsAndResultEncoding
     : public TraitBase<ConcreteType, SameOperandsAndResultEncoding> {

diff --git a/include/triton/Dialect/Triton/IR/TritonInterfaces.td b/include/triton/Dialect/Triton/IR/TritonInterfaces.td
@@ -6,7 +6,6 @@ include "mlir/Interfaces/InferTypeOpInterface.td"
 
 def TensorSizeTrait : NativeOpTrait<"TensorSizeTrait">;
 def VerifyTensorLayoutsTrait : NativeOpTrait<"VerifyTensorLayoutsTrait">;
-def DotLike : NativeOpTrait<"DotLike">;
 def SameOperandsEncoding : NativeOpTrait<"SameOperandsEncoding">;
 def SameOperandsAndResultEncoding : NativeOpTrait<"SameOperandsAndResultEncoding">;
 def SameLoadStoreOperandsShape : NativeOpTrait<"SameLoadStoreOperandsShape">;

diff --git a/include/triton/Dialect/Triton/IR/TritonOpInterfaces.td b/include/triton/Dialect/Triton/IR/TritonOpInterfaces.td
@@ -29,7 +29,27 @@ def TransposeOpInterface : OpInterface<"TransposeOpInterface"> {
 			/*args=*/(ins)>
   ];
 
-	let verify = [{ return ::mlir::triton::impl::verifyTransposeOpInterface($_op); }];
+  let verify = [{ return ::mlir::triton::impl::verifyTransposeOpInterface($_op); }];
+}
+
+def DotOpInterface : OpInterface<"DotOpInterface"> {
+  let description = [{
+	This interface is implemented by operations that perform a dot product.
+  }];
+
+  let cppNamespace = "::mlir::triton";
+
+  let methods = [
+	InterfaceMethod<
+			/*desc=*/[{
+			  Verifies the dimensions of the A and B DotOp operands.
+		  }],
+			/*retType=*/"bool",
+			/*methodName=*/"verifyDims",
+			/*args=*/(ins)>
+	];
+
+  let verify = [{ return ::mlir::triton::impl::verifyDotOpInterface($_op); }];
 }
 
 

diff --git a/include/triton/Dialect/Triton/IR/TritonOps.td b/include/triton/Dialect/Triton/IR/TritonOps.td
@@ -631,7 +631,7 @@ def TT_GetNumProgramsOp : TT_Op<"get_num_programs", [Pure]> {
 //
 def TT_DotOp : TT_Op<"dot", [Pure,
                              DeclareOpInterfaceMethods<InferTypeOpInterface>,
-                             DotLike,
+                             DeclareOpInterfaceMethods<DotOpInterface>,
                              TypesMatchWith<"result's type matches accumulator's type",
                                             "d", "c", "$_self">]> {
     let summary = "dot";
@@ -671,7 +671,7 @@ def TT_DotOp : TT_Op<"dot", [Pure,
 //
 def TT_DotScaledOp : TT_Op<"dot_scaled", [Pure,
                              AttrSizedOperandSegments,
-                             DotLike,
+                             DeclareOpInterfaceMethods<DotOpInterface>,
                              TypesMatchWith<"result's type matches accumulator's type",
                                             "d", "c", "$_self">]> {
     let summary = "dot_scaled";

diff --git a/include/triton/Dialect/TritonGPU/IR/Dialect.h b/include/triton/Dialect/TritonGPU/IR/Dialect.h
@@ -70,10 +70,12 @@ struct SharedMemory : public SideEffects::Resource::Base<SharedMemory> {
   StringRef getName() final { return "<SharedMemory>"; }
 };
 
+// Convert a distributed layout to a linear encoding
+LinearEncodingAttr toLinearEncoding(Attribute layout, ArrayRef<int64_t> shape);
+
 unsigned getTotalElemsPerThread(Type type);
 
-unsigned getTotalElemsPerThread(Attribute layout, ArrayRef<int64_t> shape,
-                                Type eltTy);
+unsigned getTotalElemsPerThread(Attribute layout, ArrayRef<int64_t> shape);
 
 SmallVector<unsigned> getElemsPerThread(Type type);
 

diff --git a/include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td b/include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td
@@ -501,13 +501,17 @@ We call each individual tile "rep".
     InterfaceMethod<"Return total element size per thread.",
                     "unsigned",
                     "getTotalElemsPerThread",
-                     (ins "ArrayRef<int64_t>":$tensorShape,
-                          "Type":$eltTy)>,
+                     (ins "ArrayRef<int64_t>":$shape),
+                     /*defaultImplementation=*/[{
+                         return toLinearEncoding($_self, shape).getTotalElemsPerThread(shape);
+                     }]>,
     InterfaceMethod<"Return element size per thread in each dimension.",
                     "SmallVector<unsigned>",
                     "getElemsPerThread",
-                     (ins "ArrayRef<int64_t>":$tensorShape,
-                          "Type":$eltTy)>,
+                     (ins "ArrayRef<int64_t>":$shape),
+                     /*defaultImplementation=*/[{
+                         return toLinearEncoding($_self, shape).getElemsPerThread(shape);
+                     }]>,
     // Interface for the meta information about the multiple thread hierarchy.
     InterfaceMethod<"Get the shape of the warps per CTA.",
                     "SmallVector<unsigned>",
@@ -577,8 +581,7 @@ L(T) = [ {0,8} , {1,9} , {2,10}, {3,11}, {0,8} , {1, 9} , {2, 10}, {3, 11},
   }];
 
   code extraDistributedDeclaration  = extraBaseClassDeclaration # [{
-    unsigned getTotalElemsPerThread(ArrayRef<int64_t> shape, Type eltTy) const;
-    SmallVector<unsigned> getElemsPerThread(ArrayRef<int64_t> shape, Type eltTy) const;
+    // Implemented in subclasses
     SmallVector<unsigned> getRepOrder() const;
     SmallVector<unsigned> getCTAsPerCGA() const;
     SmallVector<unsigned> getCTAOrder() const;
@@ -613,6 +616,10 @@ def LinearEncodingAttr : DistributedEncoding<"LinearEncoding", "linear_encoding"
   let parameters = (ins LinearLayoutParam:$linearLayout);
 
   let extraClassDeclaration = extraDistributedDeclaration # [{
+    // Generic distributed encoding methods
+    unsigned getTotalElemsPerThread(ArrayRef<int64_t> shape) const;
+    SmallVector<unsigned> getElemsPerThread(ArrayRef<int64_t> shape) const;
+
     SmallVector<unsigned> getContigPerThread() const;
     SmallVector<unsigned> getOrder() const;
 
@@ -965,7 +972,6 @@ V [ 0,4,8...60   1,5...61     2,6...62     3,7...63    ]   [ 128,132...188  129,
       return true;
     }
     SmallVector<unsigned> getSizePerThreadForOperand(int kWidth, int opIdx) const;
-    unsigned getTotalElemsPerThreadForOperand(ArrayRef<int64_t> shape, Type eltTy, int kWidth, int opIdx) const;
     SmallVector<int64_t> getInstrShapeForOperand(int kWidth, int opIdx) const;
     SmallVector<int64_t> getRepForOperand(ArrayRef<int64_t> operandShape, int kWidth, int opIdx) const;
     SmallVector<unsigned> getRepOrderForOperand(int opIdx) const;
@@ -1095,7 +1101,6 @@ Row |
       return true;
     }
     SmallVector<unsigned> getSizePerThreadForOperand(int kWidth, int opIdx) const;
-    unsigned getTotalElemsPerThreadForOperand(ArrayRef<int64_t> shape, Type eltTy, int kWidth, int opIdx) const;
     SmallVector<int64_t> getElemsPerInstrForOperands() const;
     SmallVector<int64_t> getRepForOperand(ArrayRef<int64_t> operandShape,
                                           Type elemType, int kWidth, int opIdx) const;

diff --git a/include/triton/Dialect/TritonGPU/Transforms/PipeliningUtility.h b/include/triton/Dialect/TritonGPU/Transforms/PipeliningUtility.h
@@ -10,6 +10,8 @@ namespace mlir {
 namespace triton {
 
 static const char *kNumStagesAttrName = "tt.num_stages";
+static const char *kDisallowAccMultiBufferAttrName =
+    "tt.disallow_acc_multi_buffer";
 static const char *kLoopStageAttrName = "loop.stage";
 static const char *kLoopClusterAttrName = "loop.cluster";
 
@@ -37,6 +39,10 @@ void addOps(scf::ForOp forOp, int stage,
 void replaceUsesAndPropagateType(OpBuilder &builder, Operation *oldUse,
                                  Value val);
 
+// Return true if the given ForOp has the attribute
+// `tt.disallow_acc_multi_buffer` set to true.
+bool getDisallowAccMultiBuffer(scf::ForOp forOp);
+
 // Return the minClusterId and maxClusterId for the given ForOp.
 std::pair<int, int> getMinMaxCluster(scf::ForOp &forOp);
 std::pair<int, int> getStageCluster(Operation *op);

diff --git a/include/triton/Dialect/TritonGPU/Transforms/Utility.h b/include/triton/Dialect/TritonGPU/Transforms/Utility.h
@@ -200,14 +200,6 @@ StringRef getAMDArch(Operation *module);
 std::optional<mlir::triton::gpu::SwizzledSharedEncodingAttr>
 getSharedEncIfAllUsersAreDotEnc(Value val, bool &incompatible);
 
-enum class MMALoadType {
-  SharedV3,
-  Registers,     // may be v2 or v3
-  DoNotPipeline, // could be a valid shared/registers MMA operand, but skip
-                 // pipelining
-};
-MMALoadType getMMALoadType(Operation *loadOp);
-
 // Convert \param op operands and results to layout \param encoding.
 void convertOpEncoding(Attribute encoding, Operation *op);
 

diff --git a/include/triton/Dialect/TritonNvidiaGPU/IR/CMakeLists.txt b/include/triton/Dialect/TritonNvidiaGPU/IR/CMakeLists.txt
@@ -15,3 +15,8 @@ mlir_tablegen(TritonNvidiaGPUAttrDefs.cpp.inc -gen-attrdef-defs)
 mlir_tablegen(OpsEnums.h.inc -gen-enum-decls)
 mlir_tablegen(OpsEnums.cpp.inc -gen-enum-defs)
 add_public_tablegen_target(TritonNvidiaGPUAttrDefsIncGen)
+
+set(LLVM_TARGET_DEFINITIONS TritonNvidiaGPUOpInterfaces.td)
+mlir_tablegen(TritonNvidiaGPUOpInterfaces.h.inc -gen-op-interface-decls)
+mlir_tablegen(TritonNvidiaGPUOpInterfaces.cpp.inc -gen-op-interface-defs)
+add_public_tablegen_target(TritonNvidiaGPUOpInterfacesIncGen)
diff --git a/include/triton/Dialect/TritonNvidiaGPU/IR/Dialect.h b/include/triton/Dialect/TritonNvidiaGPU/IR/Dialect.h
@@ -37,6 +37,8 @@
 #define GET_ATTRDEF_CLASSES
 #include "triton/Dialect/TritonNvidiaGPU/IR/TritonNvidiaGPUAttrDefs.h.inc"
 
+#include "triton/Dialect/TritonNvidiaGPU/IR/TritonNvidiaGPUOpInterfaces.h.inc"
+
 #define GET_OP_CLASSES
 #include "triton/Dialect/TritonNvidiaGPU/IR/Ops.h.inc"
 

diff --git a/include/triton/Dialect/TritonNvidiaGPU/IR/TritonNvidiaGPUOpInterfaces.td b/include/triton/Dialect/TritonNvidiaGPU/IR/TritonNvidiaGPUOpInterfaces.td
@@ -0,0 +1,42 @@
+#ifndef TRITON_NVIDIAGPU_OP_INTERFACES
+#define TRITON_NVIDIAGPU_OP_INTERFACES
+
+include "mlir/IR/OpBase.td"
+
+def MMAv5OpInterface : OpInterface<"MMAv5OpInterface"> {
+  let description = [{
+     This interface is implemented by MMAv5 dot and dot scaled ops.
+  }];
+
+  let cppNamespace = "::mlir::triton::nvidia_gpu";
+
+  // We can add more methods as needed.
+  let methods = [
+    InterfaceMethod<"Return the accumulator init flag.",
+                    "::mlir::Value",
+                    "useAccumulator">,
+    InterfaceMethod<"Set the accumulator init flag.",
+                    "void",
+                    "setUseAccumulator",
+                    (ins "::mlir::Value":$flag)>,
+    InterfaceMethod<"Associate a new barrier to this MMAv5 op.",
+                    "void",
+                    "setBarrier",
+                    (ins "::mlir::Value":$barrier)>,
+    InterfaceMethod<"Return the accumulator.",
+                    "::mlir::Value",
+                    "getAccumulator">,
+    InterfaceMethod<"Set the accumulator.",
+                    "void",
+                    "setAccumulator",
+                    (ins "::mlir::Value":$accum)>,
+    InterfaceMethod<"Return the predicate of this op.",
+                    "::mlir::Value",
+                    "getPredicate">,
+    InterfaceMethod<"Set the predicate of this op.",
+                    "void",
+                    "setPredicate",
+                    (ins "::mlir::Value":$pred)>,
+  ];
+}
+#endif // TRITON_NVIDIAGPU_OP_INTERFACES
Original file line number	Diff line number	Diff line change
		@@ -1 +1 @@
		c118864223c6309378cd704f3406533474c2759f
		ffe3129e9bdc146ee4d91e849173d1c64b1ae974