[PROTON-DEV] Restructure files #5846

Jokeren · 2025-02-07T02:49:04Z

No description provided.

@ThomasRaoux

[Blackwell] Enable MMA pipelining for scaled dot when TMEM copy is used (#5812) This PR enables MMA pipelining for scaled dot. The main difficulty this PR overcomes is the dependency cycle between TMEM copy rewriting and SWP - currently TMEM copy rewriting relies on SWP to put loading of scales into SMEM, while to apply MMA pipelining during SWP, TMEM copy rewriting needs to have happened beforehand. I propose to break the cycle by having loading of scales go through `local_alloc` and `local_load` in `AccelerateMatmul`. This way, TMEM copy rewriting happens during [the first call to OptimizedDotOperands,](https://github.com/triton-lang/triton/blob/1e0e51c4aeb3e1beea000da5d0e494f8b9ac40dd/third_party/nvidia/backend/compiler.py#L260) before SWP. And the local alloc and load added in `AccelerateMatmul` are eliminated during SWP. It's a bit ad hoc to add local alloc for scales there, since scales do not need to be in SMEM. But other solutions, like decoupling MMA pipelining from SWP, is more difficult. The other changes in this PR are for making SWP recognize loading of scales when there is TMEM copy between scale load and MMA. @ThomasRaoux @pawelszczerbuk @csullivan @mbrookhart @binarybana --------- Co-authored-by: Masahiro Masuda <[email protected]> Co-authored-by: Jason Knight <[email protected]> Fix default num_stages values mismatch between Python frontend and MLIR. (#5804) Address the discrepancy between the default value of num_stages in the Python frontend and the MLIR implementation. Previously, the values were not aligned, which could cause inconsistencies in behavior across Python Frontend and triton-opt. This PR updates the default value of num_stages in the Python frontend to match the ones used in MLIR. - [TritonGPUPipeline](https://github.com/triton-lang/triton/blob/032fa41a45847cdc00119ed3bdd5bc0adab9c938/include/triton/Dialect/TritonGPU/Transforms/Passes.td#L21) - [TritonGPUTestPipelineAssignLatencies](https://github.com/triton-lang/triton/blob/032fa41a45847cdc00119ed3bdd5bc0adab9c938/include/triton/Dialect/TritonGPU/Transforms/Passes.td#L58) - [TritonGPULoopScheduling](https://github.com/triton-lang/triton/blob/032fa41a45847cdc00119ed3bdd5bc0adab9c938/include/triton/Dialect/TritonGPU/Transforms/Passes.td#L266)  - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) Co-authored-by: pawelszczerbuk <[email protected]> Bump actions/checkout from 3 to 4 (#5714) Bumps [actions/checkout](https://github.com/actions/checkout) from 3 to 4. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/actions/checkout/releases">actions/checkout's releases</a>.</em></p> <blockquote> <h2>v4.0.0</h2> <h2>What's Changed</h2> <ul> <li>Update default runtime to node20 by <a href="https://github.com/takost"><code>@takost</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1436">actions/checkout#1436</a></li> <li>Support fetching without the --progress option by <a href="https://github.com/simonbaird"><code>@simonbaird</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1067">actions/checkout#1067</a></li> <li>Release 4.0.0 by <a href="https://github.com/takost"><code>@takost</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1447">actions/checkout#1447</a></li> </ul> <h2>New Contributors</h2> <ul> <li><a href="https://github.com/takost"><code>@takost</code></a> made their first contribution in <a href="https://redirect.github.com/actions/checkout/pull/1436">actions/checkout#1436</a></li> <li><a href="https://github.com/simonbaird"><code>@simonbaird</code></a> made their first contribution in <a href="https://redirect.github.com/actions/checkout/pull/1067">actions/checkout#1067</a></li> </ul> <p><strong>Full Changelog</strong>: <a href="https://github.com/actions/checkout/compare/v3...v4.0.0">https://github.com/actions/checkout/compare/v3...v4.0.0</a></p> <h2>v3.6.0</h2> <h2>What's Changed</h2> <ul> <li>Mark test scripts with Bash'isms to be run via Bash by <a href="https://github.com/dscho"><code>@dscho</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1377">actions/checkout#1377</a></li> <li>Add option to fetch tags even if fetch-depth > 0 by <a href="https://github.com/RobertWieczoreck"><code>@RobertWieczoreck</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/579">actions/checkout#579</a></li> <li>Release 3.6.0 by <a href="https://github.com/luketomlinson"><code>@luketomlinson</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1437">actions/checkout#1437</a></li> </ul> <h2>New Contributors</h2> <ul> <li><a href="https://github.com/RobertWieczoreck"><code>@RobertWieczoreck</code></a> made their first contribution in <a href="https://redirect.github.com/actions/checkout/pull/579">actions/checkout#579</a></li> <li><a href="https://github.com/luketomlinson"><code>@luketomlinson</code></a> made their first contribution in <a href="https://redirect.github.com/actions/checkout/pull/1437">actions/checkout#1437</a></li> </ul> <p><strong>Full Changelog</strong>: <a href="https://github.com/actions/checkout/compare/v3.5.3...v3.6.0">https://github.com/actions/checkout/compare/v3.5.3...v3.6.0</a></p> <h2>v3.5.3</h2> <h2>What's Changed</h2> <ul> <li>Fix: Checkout Issue in self hosted runner due to faulty submodule check-ins by <a href="https://github.com/megamanics"><code>@megamanics</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1196">actions/checkout#1196</a></li> <li>Fix typos found by codespell by <a href="https://github.com/DimitriPapadopoulos"><code>@DimitriPapadopoulos</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1287">actions/checkout#1287</a></li> <li>Add support for sparse checkouts by <a href="https://github.com/dscho"><code>@dscho</code></a> and <a href="https://github.com/dfdez"><code>@dfdez</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1369">actions/checkout#1369</a></li> <li>Release v3.5.3 by <a href="https://github.com/TingluoHuang"><code>@TingluoHuang</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1376">actions/checkout#1376</a></li> </ul> <h2>New Contributors</h2> <ul> <li><a href="https://github.com/megamanics"><code>@megamanics</code></a> made their first contribution in <a href="https://redirect.github.com/actions/checkout/pull/1196">actions/checkout#1196</a></li> <li><a href="https://github.com/DimitriPapadopoulos"><code>@DimitriPapadopoulos</code></a> made their first contribution in <a href="https://redirect.github.com/actions/checkout/pull/1287">actions/checkout#1287</a></li> <li><a href="https://github.com/dfdez"><code>@dfdez</code></a> made their first contribution in <a href="https://redirect.github.com/actions/checkout/pull/1369">actions/checkout#1369</a></li> </ul> <p><strong>Full Changelog</strong>: <a href="https://github.com/actions/checkout/compare/v3...v3.5.3">https://github.com/actions/checkout/compare/v3...v3.5.3</a></p> <h2>v3.5.2</h2> <h2>What's Changed</h2> <ul> <li>Fix: Use correct API url / endpoint in GHES by <a href="https://github.com/fhammerl"><code>@fhammerl</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1289">actions/checkout#1289</a> based on <a href="https://redirect.github.com/actions/checkout/issues/1286">#1286</a> by <a href="https://github.com/1newsr"><code>@1newsr</code></a></li> </ul> <p><strong>Full Changelog</strong>: <a href="https://github.com/actions/checkout/compare/v3.5.1...v3.5.2">https://github.com/actions/checkout/compare/v3.5.1...v3.5.2</a></p> <h2>v3.5.1</h2> <h2>What's Changed</h2> <ul> <li>Improve checkout performance on Windows runners by upgrading <code>@actions/github</code> dependency by <a href="https://github.com/BrettDong"><code>@BrettDong</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1246">actions/checkout#1246</a></li> </ul> <h2>New Contributors</h2> <ul> <li><a href="https://github.com/BrettDong"><code>@BrettDong</code></a> made their first contribution in <a href="https://redirect.github.com/actions/checkout/pull/1246">actions/checkout#1246</a></li> </ul>  </blockquote> <p>... (truncated)</p> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/actions/checkout/blob/main/CHANGELOG.md">actions/checkout's changelog</a>.</em></p> <blockquote> <h1>Changelog</h1> <h2>v4.2.2</h2> <ul> <li><code>url-helper.ts</code> now leverages well-known environment variables by <a href="https://github.com/jww3"><code>@jww3</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1941">actions/checkout#1941</a></li> <li>Expand unit test coverage for <code>isGhes</code> by <a href="https://github.com/jww3"><code>@jww3</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1946">actions/checkout#1946</a></li> </ul> <h2>v4.2.1</h2> <ul> <li>Check out other refs/* by commit if provided, fall back to ref by <a href="https://github.com/orhantoy"><code>@orhantoy</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1924">actions/checkout#1924</a></li> </ul> <h2>v4.2.0</h2> <ul> <li>Add Ref and Commit outputs by <a href="https://github.com/lucacome"><code>@lucacome</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1180">actions/checkout#1180</a></li> <li>Dependency updates by <a href="https://github.com/dependabot"><code>@dependabot</code></a>- <a href="https://redirect.github.com/actions/checkout/pull/1777">actions/checkout#1777</a>, <a href="https://redirect.github.com/actions/checkout/pull/1872">actions/checkout#1872</a></li> </ul> <h2>v4.1.7</h2> <ul> <li>Bump the minor-npm-dependencies group across 1 directory with 4 updates by <a href="https://github.com/dependabot"><code>@dependabot</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1739">actions/checkout#1739</a></li> <li>Bump actions/checkout from 3 to 4 by <a href="https://github.com/dependabot"><code>@dependabot</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1697">actions/checkout#1697</a></li> <li>Check out other refs/* by commit by <a href="https://github.com/orhantoy"><code>@orhantoy</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1774">actions/checkout#1774</a></li> <li>Pin actions/checkout's own workflows to a known, good, stable version. by <a href="https://github.com/jww3"><code>@jww3</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1776">actions/checkout#1776</a></li> </ul> <h2>v4.1.6</h2> <ul> <li>Check platform to set archive extension appropriately by <a href="https://github.com/cory-miller"><code>@cory-miller</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1732">actions/checkout#1732</a></li> </ul> <h2>v4.1.5</h2> <ul> <li>Update NPM dependencies by <a href="https://github.com/cory-miller"><code>@cory-miller</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1703">actions/checkout#1703</a></li> <li>Bump github/codeql-action from 2 to 3 by <a href="https://github.com/dependabot"><code>@dependabot</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1694">actions/checkout#1694</a></li> <li>Bump actions/setup-node from 1 to 4 by <a href="https://github.com/dependabot"><code>@dependabot</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1696">actions/checkout#1696</a></li> <li>Bump actions/upload-artifact from 2 to 4 by <a href="https://github.com/dependabot"><code>@dependabot</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1695">actions/checkout#1695</a></li> <li>README: Suggest <code>user.email</code> to be <code>41898282+github-actions[bot]@users.noreply.github.com</code> by <a href="https://github.com/cory-miller"><code>@cory-miller</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1707">actions/checkout#1707</a></li> </ul> <h2>v4.1.4</h2> <ul> <li>Disable <code>extensions.worktreeConfig</code> when disabling <code>sparse-checkout</code> by <a href="https://github.com/jww3"><code>@jww3</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1692">actions/checkout#1692</a></li> <li>Add dependabot config by <a href="https://github.com/cory-miller"><code>@cory-miller</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1688">actions/checkout#1688</a></li> <li>Bump the minor-actions-dependencies group with 2 updates by <a href="https://github.com/dependabot"><code>@dependabot</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1693">actions/checkout#1693</a></li> <li>Bump word-wrap from 1.2.3 to 1.2.5 by <a href="https://github.com/dependabot"><code>@dependabot</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1643">actions/checkout#1643</a></li> </ul> <h2>v4.1.3</h2> <ul> <li>Check git version before attempting to disable <code>sparse-checkout</code> by <a href="https://github.com/jww3"><code>@jww3</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1656">actions/checkout#1656</a></li> <li>Add SSH user parameter by <a href="https://github.com/cory-miller"><code>@cory-miller</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1685">actions/checkout#1685</a></li> <li>Update <code>actions/checkout</code> version in <code>update-main-version.yml</code> by <a href="https://github.com/jww3"><code>@jww3</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1650">actions/checkout#1650</a></li> </ul> <h2>v4.1.2</h2> <ul> <li>Fix: Disable sparse checkout whenever <code>sparse-checkout</code> option is not present <a href="https://github.com/dscho"><code>@dscho</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1598">actions/checkout#1598</a></li> </ul> <h2>v4.1.1</h2> <ul> <li>Correct link to GitHub Docs by <a href="https://github.com/peterbe"><code>@peterbe</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1511">actions/checkout#1511</a></li> <li>Link to release page from what's new section by <a href="https://github.com/cory-miller"><code>@cory-miller</code></a> in <a href="https://redirect.github.com/actions/checkout/pull/1514">actions/checkout#1514</a></li> </ul> <h2>v4.1.0</h2> <ul> <li><a href="https://redirect.github.com/actions/checkout/pull/1396">Add support for partial checkout filters</a></li> </ul>  </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/actions/checkout/commit/11bd71901bbe5b1630ceea73d27597364c9af683"><code>11bd719</code></a> Prepare 4.2.2 Release (<a href="https://redirect.github.com/actions/checkout/issues/1953">#1953</a>)</li> <li><a href="https://github.com/actions/checkout/commit/e3d2460bbb42d7710191569f88069044cfb9d8cf"><code>e3d2460</code></a> Expand unit test coverage (<a href="https://redirect.github.com/actions/checkout/issues/1946">#1946</a>)</li> <li><a href="https://github.com/actions/checkout/commit/163217dfcd28294438ea1c1c149cfaf66eec283e"><code>163217d</code></a> <code>url-helper.ts</code> now leverages well-known environment variables. (<a href="https://redirect.github.com/actions/checkout/issues/1941">#1941</a>)</li> <li><a href="https://github.com/actions/checkout/commit/eef61447b9ff4aafe5dcd4e0bbf5d482be7e7871"><code>eef6144</code></a> Prepare 4.2.1 release (<a href="https://redirect.github.com/actions/checkout/issues/1925">#1925</a>)</li> <li><a href="https://github.com/actions/checkout/commit/6b42224f41ee5dfe5395e27c8b2746f1f9955030"><code>6b42224</code></a> Add workflow file for publishing releases to immutable action package (<a href="https://redirect.github.com/actions/checkout/issues/1919">#1919</a>)</li> <li><a href="https://github.com/actions/checkout/commit/de5a000abf73b6f4965bd1bcdf8f8d94a56ea815"><code>de5a000</code></a> Check out other refs/* by commit if provided, fall back to ref (<a href="https://redirect.github.com/actions/checkout/issues/1924">#1924</a>)</li> <li><a href="https://github.com/actions/checkout/commit/d632683dd7b4114ad314bca15554477dd762a938"><code>d632683</code></a> Prepare 4.2.0 release (<a href="https://redirect.github.com/actions/checkout/issues/1878">#1878</a>)</li> <li><a href="https://github.com/actions/checkout/commit/6d193bf28034eafb982f37bd894289fe649468fc"><code>6d193bf</code></a> Bump braces from 3.0.2 to 3.0.3 (<a href="https://redirect.github.com/actions/checkout/issues/1777">#1777</a>)</li> <li><a href="https://github.com/actions/checkout/commit/db0cee9a514becbbd4a101a5fbbbf47865ee316c"><code>db0cee9</code></a> Bump the minor-npm-dependencies group across 1 directory with 4 updates (<a href="https://redirect.github.com/actions/checkout/issues/1872">#1872</a>)</li> <li><a href="https://github.com/actions/checkout/commit/b6849436894e144dbce29d7d7fda2ae3bf9d8365"><code>b684943</code></a> Add Ref and Commit outputs (<a href="https://redirect.github.com/actions/checkout/issues/1180">#1180</a>)</li> <li>Additional commits viewable in <a href="https://github.com/actions/checkout/compare/v3...v4">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=actions/checkout&package-manager=github_actions&previous-version=3&new-version=4)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> [mlir][dialect] Refactor DotLike trait into a DotOpInterface + Enable verification of scaled_dot (#5796) The Triton MLIR dialect presently has a `DotLike` trait which DotOps have. The problem with this trait is the way it is currently implemented prevents `scaled_dot` from being verified properly (dimensions are not properly checked at the moment: "TODO: enable back with an interface to support scaled dot."). This PR refactors the "DotLike" trait into an interface which implements a "verifyDims" function that checks if the dims for the A and B operands are compatible (e.g., something like MxK1 and K2xN; k1==k2; in the simple case). The initial implementation of DotOpInterface is similar to the prior `DotLike` trait with the exception that it includes the `verifyDims` function which all DotOps must implement---this function just checks whether the dimensions of the A, B inputs match. In the future this interface can be extended to include more functionality. I think since this enables the verifier for `scaled_dot`, that the existing scaled dot tests should cover any changes made in the PR---but if this is wrong I will add additional tests. ================================================================  - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x] This PR does not need a test because `I think this should be covered by existing tests`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) [ANALYSIS] Fixing overflow in AxisInfo.cpp. (#5821) UBSan detected that the shift is happening on an int32 variable, but the shift amount could be larger than what int32 supports. I believe the change is trivial to require a test, but please advice accordingly if you think there is an appropriate way to capture this. [Blackwell][Clean up] Remove use of SharedMemoryObject on TMEM (#5817) Using `getSharedMemoryObjectFromStruct` etc on TMEM is very confusing for new readers, so I'm introducing simpler alternatives for TMEM. In practice, we only need the base address of TMEM. @ThomasRaoux --------- Co-authored-by: Masahiro Masuda <[email protected]> [AMD] Support lowering GPU async copy/commit/wait ops (#5729) Support lowering of `ttg.async_copy_global_to_local` for `gfx9` GPUs. The lowering does check if the resulting writes are coalesced which is a requirement by the hardware. Also associated `ttg.async_commit_group` and `ttg.async_wait`. Note that we are currently not emitting `AsyncCopyGlobalToLocal` for AMD targets, this will come with a follow up PR. [AMD] Remove duplicate definition of use_buffer_ops (#5828) Removes a duplicate definition of use_buffer_ops when loading the `AMDGCN_USE_BUFFER_OPS` environment variable, likely due to a merge conflict. Co-authored-by: Nick Riasanovsky <[email protected]> [PIPELINE] Relax requirements for wgmma operand register pipelining (#5810) Pipeline wgmma operands in registers always if they are not suitable for shmem pipelining. This introduces a performance regression on attention workloads, but fixes a crash introduced by #5798. [Blackwell][TUTORIALS] Add tutorial 10-block-scaled-matmul.py (#5813) This tutorial demos Triton support for block scaled matrix multiply on Blackwell's 5th generation tensor core with low precision FP4 and FP8 datatypes. Planned followups include optimized TMA loads for block scale factors, and mixed precision support. Additional changes * Moves MX dtype helper classes to triton/tools/mxfp.py for use in tutorials as well as test code. @ThomasRaoux @pawelszczerbuk @masahi @mbrookhart @binarybana Revert "[AMD] Use more efficient fp32 to bf16 type conversion (#5633)" (#5829) This reverts commit 1c28e08 because the inline assembly causing issues for LLVM backend. [TensorDescriptor] Support ndim > 2 loads and stores (#5830) This adds support for 3, 4 and 5 dimensional tensor descriptors with tests that load and store work correctly. It was surprisingly easy as most of the code was already written generically, there were only a few changes needed to the TMA creation lowering, and a bug-fix for LL conversion of NVMMA encoding with >2 dims. [AMD] Avoid unneeded instruction schedule pass runs (#5832) We only want to run them if an active variant is requested. [PIPELINE] Including mmav5 dot in shmem pipelining (#5835) There is a bug introduced in: [PIPELINE] Relax requirements for wgmma operand register pipelining (#5810) I was missing mmav5 dot support. Thanks for @peterbell10 for catching that and for @masahi for proposing a fix. Disable dot transpose optimization due to perf regressions (#5834) There are cases where transposing the dot op may cause performance problems. For example, if the accumulator is being used in something like: ``` %206 = ttg.convert_layout %205 : tensor<128x64xf16, #mma> -> tensor<128x64xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> loc(#loc216) ``` transposing it makes the conversion expensive. This is likely to happen in attention kernels. [AMD] Attach variant to the scheduling hint op (#5808) This PR refactors the implementation of instruction scheduling infrastructure. A particular "sched" variant becomes a part of the instruction and gets added during the insertion pass. The instruction carries this meta-information over many passes allowing to re-use the same mechanism in some other places. doc: update readme with dev container info (#5819) Add a reference to https://github.com/redhat-et/triton-dev-containers which contains a set of Development Containers for Triton. Development containers have the following Key benefits: - **Consistency**: All developers can work with the same development environment, ensuring uniform behavior across different systems. - **Isolation**: The container prevents potential conflicts with software installed on your local machine. - **Portability**: Easily share the development environment with team members, minimizing onboarding time and setup issues.  - [ ] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x] This PR does not need a test because `It's just a documentation change`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) Signed-off-by: Maryam Tahhan <[email protected]> [NFC] Finish clean-up of #5834 (#5837) That partial revert left a couple auxiliary functions that are not currently needed. [AMD] Avoiding undefined behaviour due to llvm_unreachable (#5822) It is intended to always abort, but sometimes does something else. For example, this test deadlocks on ARM sometimes, resulting in increased flakyness. This pattern has also been used elsewhere in the file, so this makes them consistent. [AMD] Enable block pingpong for smaller tiles (#5820) Recent experiment found it also helps few more configs especially smaller tiles. Enable one cluster pingpong for the 4 times smaller tiles. [Pipeliner] Fix crash in rewriting TMA descriptor updates (#5843) Lots of our code assumes that `scf.if` has a non-empty else region, but sometimes it can be empty, which typically happens due to one of the `scf.if` canonicalizers. Just make sure to create `scf.if` with non-empty regions. This was split off from #5726 since others were hitting the crash. [Pipelinier] Fix mmav3 pipelining (#5844) Make sure we allocate the right number of slices when doing mmav3 pipelining. [BACKEND] bump to llvm/llvm-project@ffe3129e9bdc (#5814) Pulls in llvm/llvm-project#125268 which is necessary for #5563. [AMD][BACKEND] Enable cache ctrl bits for gfx950 (#5838) Enables cache control bits based on Triton's `cacheModifier` on gfx950. They are identical compared to `gfx942`. [DOC] Fix generating docs for tutorials (#5850) Do not `exit` when generating tutorials, which causes the whole documentation pipeline exit. [Tests] Re-enable tests and clean up test_matmul (#5853) Reduce unnecessary config, re-enable a test and simplify simple_matmul [Blackwell][Clean up] Introduce interface for MMAv5 ops (#5848) The goal is to let `TCGen5MMAOp` and `TCGen5MMAScaledOp` share an interface so that the rest of code can work generically with them. The MMA pipelining pass gets cleaned up a lot, and the accum init flag optimization is now automatically enabled for `TCGen5MMAScaledOp` as well. --------- Co-authored-by: Masahiro Masuda <[email protected]> [NFC] Remove custom backend callback and move it to TargetInfo (#5854) [LAYOUTS] Implement generically getElemsPerThread (#5841) While doing so, we remove the SliceEncodingAttr hack! [Pipeliner] Enable automatic loop fusion (#5726) This PR turns on automatic loop fusion in the CUDA >= 8.0 pass pipelines. Automatic loop fusion is only enabled for simple loop nests (1 outer loop, 1 inner loop), when the user requests fusion with `tl.range(..., fuse=True)` in the frontend. This PR also rewrites the persistent matmul examples to use loop nests. This is cleaner, but will also enable more powerful and flexible optimizations of loop nests in the future. Primarily, it hides the brittleless of the pipeliner behind a single layer inside the compiler, so ideally the brittleness needs to be dealt with only once and hidden from users. To achieve this, several things have been added to loop fusion: 1. To avoid generating the inner loop inside a conditional, loop nest fusion will "speculate" the length of the inner loop, essentially generating a branch where the inner loop is missing and one where the inner loop is always known to execute at least once. 2. Codegen of the loop induction variables has been slightly altered to better match the expectations of the scheduler, pipeliner(s), and `optimize-accumulator-init`. 3. Codegen of loop iter args has been altered to generate fewer SSA dependencies between the prologue, inner loop, and epilogue, making it more likely for pipelining to be successful. E.g., inner loop iter args that can be initialized outside the loop and reset in the epilogue are done so, rather than in the prologue. Some other things in this PR: * Fixed a bug in the pipeline expander * Added AxisInfo implementation for `ub::PoisonOp` I verified the performance of the rewritten persistent matmul kernels on H100 and Blackwell. Performance of `09-persistent-matmul.py` on H100. Before (2 runs) ``` root@dev-0:~/code/triton$ python python/tutorials/09-persistent-matmul.py M=32, N=32, K=32 verification naive vs: torch: ✅ cublas: ✅ persistent: ✅ TMA persistent: ✅ Tensor descriptor persistent: ✅ M=8192, N=8192, K=512 verification naive vs: torch: ✅ cublas: ✅ persistent: ✅ TMA persistent: ✅ Tensor descriptor persistent: ✅ 273.146 4025.362 ROOT ├─ nan 0.031 _ZN2at6native18elementwise_kernelILi128ELi4EZNS0_22gpu_kernel_impl_nocastIZZZNS0_23direct_copy_kernel_cudaERNS_18TensorIteratorBaseEENKUlvE1_clEvENKUlvE8_clEvEUlN3c104HalfEE_EEvS4_RKT_EUliE_EEviT1_ ├─ nan 0.027 _ZN2at6native54_GLOBAL__N__a236ace4_21_DistributionNormal_cu_0c5b6e8543distribution_elementwise_grid_stride_kernelIfLi4EZNS0_9templates4cuda20normal_and_transformIN3c104HalfEfLm4EPNS_17CUDAGeneratorImplEZZZNS4_13normal_kernelIS9_EEvRKNS_10TensorBaseEddT_ENKUlvE_clEvENKUlvE1_clEvEUlfE_EEvRNS_18TensorIteratorBaseET2_T3_EUlP24curandStatePhilox4_32_10E0_ZNS1_27distribution_nullary_kernelIS7_fLi4ES9_SO_SH_EEvSJ_SK_RKSL_T4_EUlifE_EEviNS_15PhiloxCudaStateET1_SK_ ├─ 283.506 2666.310 cublas [M=8192, N=8192, K=512] │ └─ nan 2666.310 sm90_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_off_kernel__5x_cublas ├─ 223.326 307.709 matmul_kernel [M=8192, N=8192, K=512] ├─ 259.293 265.027 matmul_kernel_descriptor_persistent [M=8192, N=8192, K=512] ├─ 238.500 288.133 matmul_kernel_persistent [M=8192, N=8192, K=512] ├─ 258.738 265.594 matmul_kernel_tma_persistent [M=8192, N=8192, K=512] └─ 295.529 232.531 torch [M=8192, N=8192, K=512] └─ nan 232.531 sm90_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_off_kernel__5x_cublas Legend (Metric: tflop16/s (inc) Min: 223.33 Max: 295.53) █ 288.31 - 295.53 █ 273.87 - 288.31 █ 259.43 - 273.87 █ 244.99 - 259.43 █ 230.55 - 244.99 █ 223.33 - 230.55 name User code ◀ Only in left graph ▶ Only in right graph root@dev-0:~/code/triton$ python python/tutorials/09-persistent-matmul.py M=32, N=32, K=32 verification naive vs: torch: ✅ cublas: ✅ persistent: ✅ TMA persistent: ✅ Tensor descriptor persistent: ✅ M=8192, N=8192, K=512 verification naive vs: torch: ✅ cublas: ✅ persistent: ✅ TMA persistent: ✅ Tensor descriptor persistent: ✅ 273.367 4022.105 ROOT ├─ nan 0.031 _ZN2at6native18elementwise_kernelILi128ELi4EZNS0_22gpu_kernel_impl_nocastIZZZNS0_23direct_copy_kernel_cudaERNS_18TensorIteratorBaseEENKUlvE1_clEvENKUlvE8_clEvEUlN3c104HalfEE_EEvS4_RKT_EUliE_EEviT1_ ├─ nan 0.027 _ZN2at6native54_GLOBAL__N__a236ace4_21_DistributionNormal_cu_0c5b6e8543distribution_elementwise_grid_stride_kernelIfLi4EZNS0_9templates4cuda20normal_and_transformIN3c104HalfEfLm4EPNS_17CUDAGeneratorImplEZZZNS4_13normal_kernelIS9_EEvRKNS_10TensorBaseEddT_ENKUlvE_clEvENKUlvE1_clEvEUlfE_EEvRNS_18TensorIteratorBaseET2_T3_EUlP24curandStatePhilox4_32_10E0_ZNS1_27distribution_nullary_kernelIS7_fLi4ES9_SO_SH_EEvSJ_SK_RKSL_T4_EUlifE_EEviNS_15PhiloxCudaStateET1_SK_ ├─ 284.284 2659.011 cublas [M=8192, N=8192, K=512] │ └─ nan 2659.011 sm90_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_off_kernel__5x_cublas ├─ 221.823 309.795 matmul_kernel [M=8192, N=8192, K=512] ├─ 254.755 269.748 matmul_kernel_descriptor_persistent [M=8192, N=8192, K=512] ├─ 240.774 285.411 matmul_kernel_persistent [M=8192, N=8192, K=512] ├─ 259.109 265.214 matmul_kernel_tma_persistent [M=8192, N=8192, K=512] └─ 295.100 232.868 torch [M=8192, N=8192, K=512] └─ nan 232.868 sm90_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_off_kernel__5x_cublas Legend (Metric: tflop16/s (inc) Min: 221.82 Max: 295.10) █ 287.77 - 295.10 █ 273.12 - 287.77 █ 258.46 - 273.12 █ 243.81 - 258.46 █ 229.15 - 243.81 █ 221.82 - 229.15 name User code ◀ Only in left graph ▶ Only in right graph ``` After: ``` root@dev-0:~/code/triton$ python python/tutorials/09-persistent-matmul.py M=32, N=32, K=32 verification naive vs: torch: ✅ cublas: ✅ persistent: ✅ TMA persistent: ✅ Tensor descriptor persistent: ✅ M=8192, N=8192, K=512 verification naive vs: torch: ✅ cublas: ✅ persistent: ✅ TMA persistent: ✅ Tensor descriptor persistent: ✅ 274.040 4012.227 ROOT ├─ nan 0.031 _ZN2at6native18elementwise_kernelILi128ELi4EZNS0_22gpu_kernel_impl_nocastIZZZNS0_23direct_copy_kernel_cudaERNS_18TensorIteratorBaseEENKUlvE1_clEvENKUlvE8_clEvEUlN3c104HalfEE_EEvS4_RKT_EUliE_EEviT1_ ├─ nan 0.027 _ZN2at6native54_GLOBAL__N__a236ace4_21_DistributionNormal_cu_0c5b6e8543distribution_elementwise_grid_stride_kernelIfLi4EZNS0_9templates4cuda20normal_and_transformIN3c104HalfEfLm4EPNS_17CUDAGeneratorImplEZZZNS4_13normal_kernelIS9_EEvRKNS_10TensorBaseEddT_ENKUlvE_clEvENKUlvE1_clEvEUlfE_EEvRNS_18TensorIteratorBaseET2_T3_EUlP24curandStatePhilox4_32_10E0_ZNS1_27distribution_nullary_kernelIS7_fLi4ES9_SO_SH_EEvSJ_SK_RKSL_T4_EUlifE_EEviNS_15PhiloxCudaStateET1_SK_ ├─ 285.369 2648.904 cublas [M=8192, N=8192, K=512] │ └─ nan 2648.904 sm90_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_off_kernel__5x_cublas ├─ 217.548 315.881 matmul_kernel [M=8192, N=8192, K=512] ├─ 262.312 261.976 matmul_kernel_descriptor_persistent [M=8192, N=8192, K=512] ├─ 244.740 280.785 matmul_kernel_persistent [M=8192, N=8192, K=512] ├─ 255.113 269.368 matmul_kernel_tma_persistent [M=8192, N=8192, K=512] └─ 292.108 235.253 torch [M=8192, N=8192, K=512] └─ nan 235.253 sm90_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_off_kernel__5x_cublas Legend (Metric: tflop16/s (inc) Min: 217.55 Max: 292.11) █ 284.65 - 292.11 █ 269.74 - 284.65 █ 254.83 - 269.74 █ 239.92 - 254.83 █ 225.00 - 239.92 █ 217.55 - 225.00 name User code ◀ Only in left graph ▶ Only in right graph root@dev-0:~/code/triton$ python python/tutorials/09-persistent-matmul.py M=32, N=32, K=32 verification naive vs: torch: ✅ cublas: ✅ persistent: ✅ TMA persistent: ✅ Tensor descriptor persistent: ✅ M=8192, N=8192, K=512 verification naive vs: torch: ✅ cublas: ✅ persistent: ✅ TMA persistent: ✅ Tensor descriptor persistent: ✅ 274.997 3998.267 ROOT ├─ nan 0.031 _ZN2at6native18elementwise_kernelILi128ELi4EZNS0_22gpu_kernel_impl_nocastIZZZNS0_23direct_copy_kernel_cudaERNS_18TensorIteratorBaseEENKUlvE1_clEvENKUlvE8_clEvEUlN3c104HalfEE_EEvS4_RKT_EUliE_EEviT1_ ├─ nan 0.027 _ZN2at6native54_GLOBAL__N__a236ace4_21_DistributionNormal_cu_0c5b6e8543distribution_elementwise_grid_stride_kernelIfLi4EZNS0_9templates4cuda20normal_and_transformIN3c104HalfEfLm4EPNS_17CUDAGeneratorImplEZZZNS4_13normal_kernelIS9_EEvRKNS_10TensorBaseEddT_ENKUlvE_clEvENKUlvE1_clEvEUlfE_EEvRNS_18TensorIteratorBaseET2_T3_EUlP24curandStatePhilox4_32_10E0_ZNS1_27distribution_nullary_kernelIS7_fLi4ES9_SO_SH_EEvSJ_SK_RKSL_T4_EUlifE_EEviNS_15PhiloxCudaStateET1_SK_ ├─ 285.498 2647.706 cublas [M=8192, N=8192, K=512] │ └─ nan 2647.706 sm90_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_off_kernel__5x_cublas ├─ 217.884 315.394 matmul_kernel [M=8192, N=8192, K=512] ├─ 262.534 261.755 matmul_kernel_descriptor_persistent [M=8192, N=8192, K=512] ├─ 246.617 278.649 matmul_kernel_persistent [M=8192, N=8192, K=512] ├─ 262.525 261.764 matmul_kernel_tma_persistent [M=8192, N=8192, K=512] └─ 295.007 232.942 torch [M=8192, N=8192, K=512] └─ nan 232.942 sm90_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_off_kernel__5x_cublas Legend (Metric: tflop16/s (inc) Min: 217.88 Max: 295.01) █ 287.29 - 295.01 █ 271.87 - 287.29 █ 256.45 - 271.87 █ 241.02 - 256.45 █ 225.60 - 241.02 █ 217.88 - 225.60 name User code ◀ Only in left graph ▶ Only in right graph ``` [Blackwell] Hoist constant TMem allocation out of the loop (#5857) For cases where TMem is constant hoisting the allocation of the loop avoid having to store multiple times. [AMD] Use LLVM ops for fp16<->fp32 casts (#5859) Inline assembly can be a blocker for LLVM backend to optimize. [BW][PIPELINE] Add an option to tl.range to disallow accumulator multi-buffering (#5858) Rework mmav5 pipelining to allow pipelining of mma when multibuffering of the accumulator is impossible by putting uses in the same stage as the mma and blocking on wait until current mma finishes. Based on this support, introducing new flag to `tl.range` that controls if multibuffering of the accumulator of the dots in the loop is allowed. Without the mentioned rework of mmav5 pipelining we would simply not pipeline cases where mutibuffering is disallowed. [AMD] NFC: Refactor DotOpMFMAConversionHelper (#5862) This PR refactored `DotOpMFMAConversionHelper` by extracting utility functions from `convertDot` to make it easier to be extended in #5845. [AMD][CI] Switch to rocm hosted docker image (#5855) The CI docker image has been moved from `rocmshared/pytorch` to `rocm/pytorch` so we need to adjust our workflow. [triton][NFC] Refactor build function of ScanOp (#5861) It's better to use ScanOp's build function to replace ReduceOp's.  - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.)

@ThomasRaoux

…ed (#5812) This PR enables MMA pipelining for scaled dot. The main difficulty this PR overcomes is the dependency cycle between TMEM copy rewriting and SWP - currently TMEM copy rewriting relies on SWP to put loading of scales into SMEM, while to apply MMA pipelining during SWP, TMEM copy rewriting needs to have happened beforehand. I propose to break the cycle by having loading of scales go through `local_alloc` and `local_load` in `AccelerateMatmul`. This way, TMEM copy rewriting happens during [the first call to OptimizedDotOperands,](https://github.com/triton-lang/triton/blob/1e0e51c4aeb3e1beea000da5d0e494f8b9ac40dd/third_party/nvidia/backend/compiler.py#L260) before SWP. And the local alloc and load added in `AccelerateMatmul` are eliminated during SWP. It's a bit ad hoc to add local alloc for scales there, since scales do not need to be in SMEM. But other solutions, like decoupling MMA pipelining from SWP, is more difficult. The other changes in this PR are for making SWP recognize loading of scales when there is TMEM copy between scale load and MMA. @ThomasRaoux @pawelszczerbuk @csullivan @mbrookhart @binarybana --------- Co-authored-by: Masahiro Masuda <[email protected]> Co-authored-by: Jason Knight <[email protected]>

…5810) Pipeline wgmma operands in registers always if they are not suitable for shmem pipelining. This introduces a performance regression on attention workloads, but fixes a crash introduced by #5798.

Make sure we allocate the right number of slices when doing mmav3 pipelining.

apgoucher approved these changes Feb 7, 2025

View reviewed changes

Jokeren force-pushed the keren/proton-dev-gpu branch 3 times, most recently from c1e53ff to 04b6c94 Compare February 10, 2025 16:31

Jokeren and others added 3 commits February 10, 2025 11:32

Update

c559b61

Jokeren force-pushed the keren/proton-dev-gpu branch from 04b6c94 to f71c805 Compare February 10, 2025 16:33

pawelszczerbuk and others added 2 commits February 10, 2025 11:33

[PIPELINE] Relax requirements for wgmma operand register pipelining (#…

8da62f9

…5810) Pipeline wgmma operands in registers always if they are not suitable for shmem pipelining. This introduces a performance regression on attention workloads, but fixes a crash introduced by #5798.

[Pipelinier] Fix mmav3 pipelining (#5844)

f71c805

Make sure we allocate the right number of slices when doing mmav3 pipelining.

Jokeren closed this Feb 10, 2025

Jokeren deleted the keren/proton-dev-gpu branch February 10, 2025 16:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROTON-DEV] Restructure files #5846

[PROTON-DEV] Restructure files #5846

Jokeren commented Feb 7, 2025

[PROTON-DEV] Restructure files #5846

[PROTON-DEV] Restructure files #5846

Conversation

Jokeren commented Feb 7, 2025