Skip to content

Commit

Permalink
Update
Browse files Browse the repository at this point in the history
[Blackwell] Enable MMA pipelining for scaled dot when TMEM copy is used (#5812)

This PR enables MMA pipelining for scaled dot.

The main difficulty this PR overcomes is the dependency cycle between
TMEM copy rewriting and SWP - currently TMEM copy rewriting relies on
SWP to put loading of scales into SMEM, while to apply MMA pipelining
during SWP, TMEM copy rewriting needs to have happened beforehand. I
propose to break the cycle by having loading of scales go through
`local_alloc` and `local_load` in `AccelerateMatmul`. This way, TMEM
copy rewriting happens during [the first call to
OptimizedDotOperands,](https://github.com/triton-lang/triton/blob/1e0e51c4aeb3e1beea000da5d0e494f8b9ac40dd/third_party/nvidia/backend/compiler.py#L260)
before SWP. And the local alloc and load added in `AccelerateMatmul` are
eliminated during SWP. It's a bit ad hoc to add local alloc for scales
there, since scales do not need to be in SMEM. But other solutions, like
decoupling MMA pipelining from SWP, is more difficult.

The other changes in this PR are for making SWP recognize loading of
scales when there is TMEM copy between scale load and MMA.

@ThomasRaoux @pawelszczerbuk @csullivan @mbrookhart @binarybana

---------

Co-authored-by: Masahiro Masuda <[email protected]>
Co-authored-by: Jason Knight <[email protected]>

Fix default num_stages values mismatch between Python frontend and MLIR. (#5804)

Address the discrepancy between the default value of num_stages in the
Python frontend and the MLIR implementation. Previously, the values were
not aligned, which could cause inconsistencies in behavior across Python
Frontend and triton-opt.

This PR updates the default value of num_stages in the Python frontend
to match the ones used in MLIR.
-
[TritonGPUPipeline](https://github.com/triton-lang/triton/blob/032fa41a45847cdc00119ed3bdd5bc0adab9c938/include/triton/Dialect/TritonGPU/Transforms/Passes.td#L21)
-
[TritonGPUTestPipelineAssignLatencies](https://github.com/triton-lang/triton/blob/032fa41a45847cdc00119ed3bdd5bc0adab9c938/include/triton/Dialect/TritonGPU/Transforms/Passes.td#L58)
-
[TritonGPULoopScheduling](https://github.com/triton-lang/triton/blob/032fa41a45847cdc00119ed3bdd5bc0adab9c938/include/triton/Dialect/TritonGPU/Transforms/Passes.td#L266)

<!---
The core Triton is a small number of people, and we receive many PRs
(thank
you!).  To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them.
-->

- [x] I am not making a trivial change, such as fixing a typo in a
comment.

- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [ ] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
  - [x] This PR does not need a test because `FILL THIS IN`.

- Select one of the following.
  - [x] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)

Co-authored-by: pawelszczerbuk <[email protected]>

Bump actions/checkout from 3 to 4 (#5714)

Bumps [actions/checkout](https://github.com/actions/checkout) from 3 to
4.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/actions/checkout/releases">actions/checkout's
releases</a>.</em></p>
<blockquote>
<h2>v4.0.0</h2>
<h2>What's Changed</h2>
<ul>
<li>Update default runtime to node20 by <a
href="https://github.com/takost"><code>@​takost</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/1436">actions/checkout#1436</a></li>
<li>Support fetching without the --progress option by <a
href="https://github.com/simonbaird"><code>@​simonbaird</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/1067">actions/checkout#1067</a></li>
<li>Release 4.0.0 by <a
href="https://github.com/takost"><code>@​takost</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/1447">actions/checkout#1447</a></li>
</ul>
<h2>New Contributors</h2>
<ul>
<li><a href="https://github.com/takost"><code>@​takost</code></a> made
their first contribution in <a
href="https://redirect.github.com/actions/checkout/pull/1436">actions/checkout#1436</a></li>
<li><a
href="https://github.com/simonbaird"><code>@​simonbaird</code></a> made
their first contribution in <a
href="https://redirect.github.com/actions/checkout/pull/1067">actions/checkout#1067</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/actions/checkout/compare/v3...v4.0.0">https://github.com/actions/checkout/compare/v3...v4.0.0</a></p>
<h2>v3.6.0</h2>
<h2>What's Changed</h2>
<ul>
<li>Mark test scripts with Bash'isms to be run via Bash by <a
href="https://github.com/dscho"><code>@​dscho</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/1377">actions/checkout#1377</a></li>
<li>Add option to fetch tags even if fetch-depth &gt; 0 by <a
href="https://github.com/RobertWieczoreck"><code>@​RobertWieczoreck</code></a>
in <a
href="https://redirect.github.com/actions/checkout/pull/579">actions/checkout#579</a></li>
<li>Release 3.6.0 by <a
href="https://github.com/luketomlinson"><code>@​luketomlinson</code></a>
in <a
href="https://redirect.github.com/actions/checkout/pull/1437">actions/checkout#1437</a></li>
</ul>
<h2>New Contributors</h2>
<ul>
<li><a
href="https://github.com/RobertWieczoreck"><code>@​RobertWieczoreck</code></a>
made their first contribution in <a
href="https://redirect.github.com/actions/checkout/pull/579">actions/checkout#579</a></li>
<li><a
href="https://github.com/luketomlinson"><code>@​luketomlinson</code></a>
made their first contribution in <a
href="https://redirect.github.com/actions/checkout/pull/1437">actions/checkout#1437</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/actions/checkout/compare/v3.5.3...v3.6.0">https://github.com/actions/checkout/compare/v3.5.3...v3.6.0</a></p>
<h2>v3.5.3</h2>
<h2>What's Changed</h2>
<ul>
<li>Fix: Checkout Issue in self hosted runner due to faulty submodule
check-ins by <a
href="https://github.com/megamanics"><code>@​megamanics</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/1196">actions/checkout#1196</a></li>
<li>Fix typos found by codespell by <a
href="https://github.com/DimitriPapadopoulos"><code>@​DimitriPapadopoulos</code></a>
in <a
href="https://redirect.github.com/actions/checkout/pull/1287">actions/checkout#1287</a></li>
<li>Add support for sparse checkouts by <a
href="https://github.com/dscho"><code>@​dscho</code></a> and <a
href="https://github.com/dfdez"><code>@​dfdez</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/1369">actions/checkout#1369</a></li>
<li>Release v3.5.3 by <a
href="https://github.com/TingluoHuang"><code>@​TingluoHuang</code></a>
in <a
href="https://redirect.github.com/actions/checkout/pull/1376">actions/checkout#1376</a></li>
</ul>
<h2>New Contributors</h2>
<ul>
<li><a
href="https://github.com/megamanics"><code>@​megamanics</code></a> made
their first contribution in <a
href="https://redirect.github.com/actions/checkout/pull/1196">actions/checkout#1196</a></li>
<li><a
href="https://github.com/DimitriPapadopoulos"><code>@​DimitriPapadopoulos</code></a>
made their first contribution in <a
href="https://redirect.github.com/actions/checkout/pull/1287">actions/checkout#1287</a></li>
<li><a href="https://github.com/dfdez"><code>@​dfdez</code></a> made
their first contribution in <a
href="https://redirect.github.com/actions/checkout/pull/1369">actions/checkout#1369</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/actions/checkout/compare/v3...v3.5.3">https://github.com/actions/checkout/compare/v3...v3.5.3</a></p>
<h2>v3.5.2</h2>
<h2>What's Changed</h2>
<ul>
<li>Fix: Use correct API url / endpoint in GHES by <a
href="https://github.com/fhammerl"><code>@​fhammerl</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/1289">actions/checkout#1289</a>
based on <a
href="https://redirect.github.com/actions/checkout/issues/1286">#1286</a>
by <a href="https://github.com/1newsr"><code>@​1newsr</code></a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/actions/checkout/compare/v3.5.1...v3.5.2">https://github.com/actions/checkout/compare/v3.5.1...v3.5.2</a></p>
<h2>v3.5.1</h2>
<h2>What's Changed</h2>
<ul>
<li>Improve checkout performance on Windows runners by upgrading
<code>@​actions/github</code> dependency by <a
href="https://github.com/BrettDong"><code>@​BrettDong</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/1246">actions/checkout#1246</a></li>
</ul>
<h2>New Contributors</h2>
<ul>
<li><a href="https://github.com/BrettDong"><code>@​BrettDong</code></a>
made their first contribution in <a
href="https://redirect.github.com/actions/checkout/pull/1246">actions/checkout#1246</a></li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/actions/checkout/blob/main/CHANGELOG.md">actions/checkout's
changelog</a>.</em></p>
<blockquote>
<h1>Changelog</h1>
<h2>v4.2.2</h2>
<ul>
<li><code>url-helper.ts</code> now leverages well-known environment
variables by <a href="https://github.com/jww3"><code>@​jww3</code></a>
in <a
href="https://redirect.github.com/actions/checkout/pull/1941">actions/checkout#1941</a></li>
<li>Expand unit test coverage for <code>isGhes</code> by <a
href="https://github.com/jww3"><code>@​jww3</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/1946">actions/checkout#1946</a></li>
</ul>
<h2>v4.2.1</h2>
<ul>
<li>Check out other refs/* by commit if provided, fall back to ref by <a
href="https://github.com/orhantoy"><code>@​orhantoy</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/1924">actions/checkout#1924</a></li>
</ul>
<h2>v4.2.0</h2>
<ul>
<li>Add Ref and Commit outputs by <a
href="https://github.com/lucacome"><code>@​lucacome</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/1180">actions/checkout#1180</a></li>
<li>Dependency updates by <a
href="https://github.com/dependabot"><code>@​dependabot</code></a>- <a
href="https://redirect.github.com/actions/checkout/pull/1777">actions/checkout#1777</a>,
<a
href="https://redirect.github.com/actions/checkout/pull/1872">actions/checkout#1872</a></li>
</ul>
<h2>v4.1.7</h2>
<ul>
<li>Bump the minor-npm-dependencies group across 1 directory with 4
updates by <a
href="https://github.com/dependabot"><code>@​dependabot</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/1739">actions/checkout#1739</a></li>
<li>Bump actions/checkout from 3 to 4 by <a
href="https://github.com/dependabot"><code>@​dependabot</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/1697">actions/checkout#1697</a></li>
<li>Check out other refs/* by commit by <a
href="https://github.com/orhantoy"><code>@​orhantoy</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/1774">actions/checkout#1774</a></li>
<li>Pin actions/checkout's own workflows to a known, good, stable
version. by <a href="https://github.com/jww3"><code>@​jww3</code></a> in
<a
href="https://redirect.github.com/actions/checkout/pull/1776">actions/checkout#1776</a></li>
</ul>
<h2>v4.1.6</h2>
<ul>
<li>Check platform to set archive extension appropriately by <a
href="https://github.com/cory-miller"><code>@​cory-miller</code></a> in
<a
href="https://redirect.github.com/actions/checkout/pull/1732">actions/checkout#1732</a></li>
</ul>
<h2>v4.1.5</h2>
<ul>
<li>Update NPM dependencies by <a
href="https://github.com/cory-miller"><code>@​cory-miller</code></a> in
<a
href="https://redirect.github.com/actions/checkout/pull/1703">actions/checkout#1703</a></li>
<li>Bump github/codeql-action from 2 to 3 by <a
href="https://github.com/dependabot"><code>@​dependabot</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/1694">actions/checkout#1694</a></li>
<li>Bump actions/setup-node from 1 to 4 by <a
href="https://github.com/dependabot"><code>@​dependabot</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/1696">actions/checkout#1696</a></li>
<li>Bump actions/upload-artifact from 2 to 4 by <a
href="https://github.com/dependabot"><code>@​dependabot</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/1695">actions/checkout#1695</a></li>
<li>README: Suggest <code>user.email</code> to be
<code>41898282+github-actions[bot]@users.noreply.github.com</code> by <a
href="https://github.com/cory-miller"><code>@​cory-miller</code></a> in
<a
href="https://redirect.github.com/actions/checkout/pull/1707">actions/checkout#1707</a></li>
</ul>
<h2>v4.1.4</h2>
<ul>
<li>Disable <code>extensions.worktreeConfig</code> when disabling
<code>sparse-checkout</code> by <a
href="https://github.com/jww3"><code>@​jww3</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/1692">actions/checkout#1692</a></li>
<li>Add dependabot config by <a
href="https://github.com/cory-miller"><code>@​cory-miller</code></a> in
<a
href="https://redirect.github.com/actions/checkout/pull/1688">actions/checkout#1688</a></li>
<li>Bump the minor-actions-dependencies group with 2 updates by <a
href="https://github.com/dependabot"><code>@​dependabot</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/1693">actions/checkout#1693</a></li>
<li>Bump word-wrap from 1.2.3 to 1.2.5 by <a
href="https://github.com/dependabot"><code>@​dependabot</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/1643">actions/checkout#1643</a></li>
</ul>
<h2>v4.1.3</h2>
<ul>
<li>Check git version before attempting to disable
<code>sparse-checkout</code> by <a
href="https://github.com/jww3"><code>@​jww3</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/1656">actions/checkout#1656</a></li>
<li>Add SSH user parameter by <a
href="https://github.com/cory-miller"><code>@​cory-miller</code></a> in
<a
href="https://redirect.github.com/actions/checkout/pull/1685">actions/checkout#1685</a></li>
<li>Update <code>actions/checkout</code> version in
<code>update-main-version.yml</code> by <a
href="https://github.com/jww3"><code>@​jww3</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/1650">actions/checkout#1650</a></li>
</ul>
<h2>v4.1.2</h2>
<ul>
<li>Fix: Disable sparse checkout whenever <code>sparse-checkout</code>
option is not present <a
href="https://github.com/dscho"><code>@​dscho</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/1598">actions/checkout#1598</a></li>
</ul>
<h2>v4.1.1</h2>
<ul>
<li>Correct link to GitHub Docs by <a
href="https://github.com/peterbe"><code>@​peterbe</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/1511">actions/checkout#1511</a></li>
<li>Link to release page from what's new section by <a
href="https://github.com/cory-miller"><code>@​cory-miller</code></a> in
<a
href="https://redirect.github.com/actions/checkout/pull/1514">actions/checkout#1514</a></li>
</ul>
<h2>v4.1.0</h2>
<ul>
<li><a href="https://redirect.github.com/actions/checkout/pull/1396">Add
support for partial checkout filters</a></li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/actions/checkout/commit/11bd71901bbe5b1630ceea73d27597364c9af683"><code>11bd719</code></a>
Prepare 4.2.2 Release (<a
href="https://redirect.github.com/actions/checkout/issues/1953">#1953</a>)</li>
<li><a
href="https://github.com/actions/checkout/commit/e3d2460bbb42d7710191569f88069044cfb9d8cf"><code>e3d2460</code></a>
Expand unit test coverage (<a
href="https://redirect.github.com/actions/checkout/issues/1946">#1946</a>)</li>
<li><a
href="https://github.com/actions/checkout/commit/163217dfcd28294438ea1c1c149cfaf66eec283e"><code>163217d</code></a>
<code>url-helper.ts</code> now leverages well-known environment
variables. (<a
href="https://redirect.github.com/actions/checkout/issues/1941">#1941</a>)</li>
<li><a
href="https://github.com/actions/checkout/commit/eef61447b9ff4aafe5dcd4e0bbf5d482be7e7871"><code>eef6144</code></a>
Prepare 4.2.1 release (<a
href="https://redirect.github.com/actions/checkout/issues/1925">#1925</a>)</li>
<li><a
href="https://github.com/actions/checkout/commit/6b42224f41ee5dfe5395e27c8b2746f1f9955030"><code>6b42224</code></a>
Add workflow file for publishing releases to immutable action package
(<a
href="https://redirect.github.com/actions/checkout/issues/1919">#1919</a>)</li>
<li><a
href="https://github.com/actions/checkout/commit/de5a000abf73b6f4965bd1bcdf8f8d94a56ea815"><code>de5a000</code></a>
Check out other refs/* by commit if provided, fall back to ref (<a
href="https://redirect.github.com/actions/checkout/issues/1924">#1924</a>)</li>
<li><a
href="https://github.com/actions/checkout/commit/d632683dd7b4114ad314bca15554477dd762a938"><code>d632683</code></a>
Prepare 4.2.0 release (<a
href="https://redirect.github.com/actions/checkout/issues/1878">#1878</a>)</li>
<li><a
href="https://github.com/actions/checkout/commit/6d193bf28034eafb982f37bd894289fe649468fc"><code>6d193bf</code></a>
Bump braces from 3.0.2 to 3.0.3 (<a
href="https://redirect.github.com/actions/checkout/issues/1777">#1777</a>)</li>
<li><a
href="https://github.com/actions/checkout/commit/db0cee9a514becbbd4a101a5fbbbf47865ee316c"><code>db0cee9</code></a>
Bump the minor-npm-dependencies group across 1 directory with 4 updates
(<a
href="https://redirect.github.com/actions/checkout/issues/1872">#1872</a>)</li>
<li><a
href="https://github.com/actions/checkout/commit/b6849436894e144dbce29d7d7fda2ae3bf9d8365"><code>b684943</code></a>
Add Ref and Commit outputs (<a
href="https://redirect.github.com/actions/checkout/issues/1180">#1180</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/actions/checkout/compare/v3...v4">compare
view</a></li>
</ul>
</details>
<br />

[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=actions/checkout&package-manager=github_actions&previous-version=3&new-version=4)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)

</details>

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[mlir][dialect] Refactor DotLike trait into a DotOpInterface + Enable verification of scaled_dot (#5796)

The Triton MLIR dialect presently has a `DotLike` trait which DotOps
have.

The problem with this trait is the way it is currently implemented
prevents `scaled_dot` from being verified properly (dimensions are not
properly checked at the moment: "TODO: enable back with an interface to
support scaled dot.").

This PR refactors the "DotLike" trait into an interface which implements
a "verifyDims" function that checks if the dims for the A and B operands
are compatible (e.g., something like MxK1 and K2xN; k1==k2; in the
simple case).

The initial implementation of DotOpInterface is similar to the prior
`DotLike` trait with the exception that it includes the `verifyDims`
function which all DotOps must implement---this function just checks
whether the dimensions of the A, B inputs match.

In the future this interface can be extended to include more
functionality.

I think since this enables the verifier for `scaled_dot`, that the
existing scaled dot tests should cover any changes made in the PR---but
if this is wrong I will add additional tests.

================================================================
<!---
The core Triton is a small number of people, and we receive many PRs
(thank
you!).  To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them.
-->

- [x] I am not making a trivial change, such as fixing a typo in a
comment.

- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [ ] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
- [x] This PR does not need a test because `I think this should be
covered by existing tests`.

- Select one of the following.
  - [x] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)

[ANALYSIS] Fixing overflow in AxisInfo.cpp. (#5821)

UBSan detected that the shift is happening on an int32 variable, but the
shift amount could be larger than what int32 supports.

I believe the change is trivial to require a test, but please advice
accordingly if you think there is an appropriate way to capture this.

[Blackwell][Clean up] Remove use of SharedMemoryObject on TMEM (#5817)

Using `getSharedMemoryObjectFromStruct` etc on TMEM is very confusing
for new readers, so I'm introducing simpler alternatives for TMEM. In
practice, we only need the base address of TMEM.

@ThomasRaoux

---------

Co-authored-by: Masahiro Masuda <[email protected]>

[AMD] Support lowering GPU async copy/commit/wait ops (#5729)

Support lowering of `ttg.async_copy_global_to_local` for `gfx9` GPUs.
The lowering does check if the resulting writes are coalesced which is a
requirement by the hardware. Also associated `ttg.async_commit_group`
and `ttg.async_wait`.

Note that we are currently not emitting `AsyncCopyGlobalToLocal` for
AMD targets, this will come with a follow up PR.

[AMD] Remove duplicate definition of use_buffer_ops (#5828)

Removes a duplicate definition of use_buffer_ops when loading the
`AMDGCN_USE_BUFFER_OPS` environment variable, likely due to a merge
conflict.

Co-authored-by: Nick Riasanovsky <[email protected]>

[PIPELINE] Relax requirements for wgmma operand register pipelining (#5810)

Pipeline wgmma operands in registers always if they are not suitable for
shmem pipelining.
This introduces a performance regression on attention workloads, but
fixes a crash introduced by
#5798.

[Blackwell][TUTORIALS] Add tutorial 10-block-scaled-matmul.py (#5813)

This tutorial demos Triton support for block scaled matrix multiply on
Blackwell's 5th generation tensor core with low precision FP4 and FP8
datatypes. Planned followups include optimized TMA loads for block scale
factors, and mixed precision support.

Additional changes
* Moves MX dtype helper classes to triton/tools/mxfp.py for use in
tutorials as well as test code.

@ThomasRaoux @pawelszczerbuk @masahi @mbrookhart @binarybana

Revert "[AMD] Use more efficient fp32 to bf16 type conversion (#5633)" (#5829)

This reverts commit 1c28e08 because the
inline assembly causing issues for LLVM backend.

[TensorDescriptor] Support ndim > 2 loads and stores (#5830)

This adds support for 3, 4 and 5 dimensional tensor descriptors with
tests that load and store work correctly. It was surprisingly easy as
most of the code was already written generically, there were only a few
changes needed to the TMA creation lowering, and a bug-fix for LL
conversion of NVMMA encoding with >2 dims.

[AMD] Avoid unneeded instruction schedule pass runs (#5832)

We only want to run them if an active variant is requested.

[PIPELINE] Including mmav5 dot in shmem pipelining (#5835)

There is a bug introduced in: [PIPELINE] Relax requirements for wgmma
operand register pipelining (#5810)
I was missing mmav5 dot support. Thanks for @peterbell10 for catching
that and for @masahi for proposing a fix.

Disable dot transpose optimization due to perf regressions (#5834)

There are cases where transposing the dot op may cause performance
problems. For example, if the accumulator is being used in something
like:
```
 %206 = ttg.convert_layout %205 : tensor<128x64xf16, #mma> -> tensor<128x64xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> loc(#loc216)
```
transposing it makes the conversion expensive. This is likely to happen
in attention kernels.

[AMD] Attach variant to the scheduling hint op (#5808)

This PR refactors the implementation of instruction scheduling
infrastructure. A particular "sched" variant becomes a part of the
instruction and gets added during the insertion pass. The instruction
carries this meta-information over many passes allowing to re-use the
same mechanism in some other places.

doc: update readme with dev container info (#5819)

Add a reference to https://github.com/redhat-et/triton-dev-containers
which contains a set of Development Containers for Triton. Development
containers have the following Key benefits:

- **Consistency**: All developers can work with the same development
environment, ensuring uniform behavior across different systems.
- **Isolation**: The container prevents potential conflicts with
software installed on your local machine.
- **Portability**: Easily share the development environment with team
members, minimizing onboarding time and setup issues.

<!---
The core Triton is a small number of people, and we receive many PRs
(thank
you!).  To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them.
-->

- [ ] I am not making a trivial change, such as fixing a typo in a
comment.

- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [ ] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
- [x] This PR does not need a test because `It's just a documentation
change`.

- Select one of the following.
  - [x] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)

Signed-off-by: Maryam Tahhan <[email protected]>

[NFC] Finish clean-up of #5834 (#5837)

That partial revert left a couple auxiliary functions that are not
currently needed.

[AMD] Avoiding undefined behaviour due to llvm_unreachable (#5822)

It is intended to always abort, but sometimes does something else. For
example, this test deadlocks on ARM sometimes, resulting in increased
flakyness.

This pattern has also been used elsewhere in the file, so this makes
them consistent.

[AMD] Enable block pingpong for smaller tiles (#5820)

Recent experiment found it also helps few more configs especially
smaller tiles.
Enable one cluster pingpong for the 4 times smaller tiles.

[Pipeliner] Fix crash in rewriting TMA descriptor updates (#5843)

Lots of our code assumes that `scf.if` has a non-empty else region, but
sometimes it can be empty, which typically happens due to one of the
`scf.if` canonicalizers. Just make sure to create `scf.if` with
non-empty regions.

This was split off from #5726
since others were hitting the crash.

[Pipelinier] Fix mmav3 pipelining (#5844)

Make sure we allocate the right number of slices when doing mmav3
pipelining.

[BACKEND] bump to llvm/llvm-project@ffe3129e9bdc (#5814)

Pulls in llvm/llvm-project#125268 which is
necessary for #5563.

[AMD][BACKEND] Enable cache ctrl bits for gfx950 (#5838)

Enables cache control bits based on Triton's `cacheModifier` on gfx950.
They are identical compared to `gfx942`.

[DOC] Fix generating docs for tutorials (#5850)

Do not `exit` when generating tutorials, which causes the whole
documentation pipeline exit.

[Tests] Re-enable tests and clean up test_matmul (#5853)

Reduce unnecessary config, re-enable a test and simplify simple_matmul

[Blackwell][Clean up] Introduce interface for MMAv5 ops (#5848)

The goal is to let `TCGen5MMAOp` and `TCGen5MMAScaledOp` share an
interface so that the rest of code can work generically with them. The
MMA pipelining pass gets cleaned up a lot, and the accum init flag
optimization is now automatically enabled for `TCGen5MMAScaledOp` as
well.

---------

Co-authored-by: Masahiro Masuda <[email protected]>

[NFC] Remove custom backend callback and move it to TargetInfo (#5854)

[LAYOUTS] Implement generically getElemsPerThread (#5841)

While doing so, we remove the SliceEncodingAttr hack!

[Pipeliner] Enable automatic loop fusion (#5726)

This PR turns on automatic loop fusion in the CUDA >= 8.0 pass
pipelines. Automatic loop fusion is only enabled for simple loop nests
(1 outer loop, 1 inner loop), when the user requests fusion with
`tl.range(..., fuse=True)` in the frontend.

This PR also rewrites the persistent matmul examples to use loop nests.
This is cleaner, but will also enable more powerful and flexible
optimizations of loop nests in the future. Primarily, it hides the
brittleless of the pipeliner behind a single layer inside the compiler,
so ideally the brittleness needs to be dealt with only once and hidden
from users.

To achieve this, several things have been added to loop fusion:

1. To avoid generating the inner loop inside a conditional, loop nest
fusion will "speculate" the length of the inner loop, essentially
generating a branch where the inner loop is missing and one where the
inner loop is always known to execute at least once.
2. Codegen of the loop induction variables has been slightly altered to
better match the expectations of the scheduler, pipeliner(s), and
`optimize-accumulator-init`.
3. Codegen of loop iter args has been altered to generate fewer SSA
dependencies between the prologue, inner loop, and epilogue, making it
more likely for pipelining to be successful. E.g., inner loop iter args
that can be initialized outside the loop and reset in the epilogue are
done so, rather than in the prologue.

Some other things in this PR:

* Fixed a bug in the pipeline expander
* Added AxisInfo implementation for `ub::PoisonOp`

I verified the performance of the rewritten persistent matmul kernels on
H100 and Blackwell.

Performance of `09-persistent-matmul.py` on H100.

Before (2 runs)

```
root@dev-0:~/code/triton$ python python/tutorials/09-persistent-matmul.py
M=32, N=32, K=32 verification naive vs: torch: ✅ cublas: ✅ persistent: ✅ TMA persistent: ✅ Tensor descriptor persistent: ✅
M=8192, N=8192, K=512 verification naive vs: torch: ✅ cublas: ✅ persistent: ✅ TMA persistent: ✅ Tensor descriptor persistent: ✅
273.146 4025.362 ROOT
├─ nan 0.031 _ZN2at6native18elementwise_kernelILi128ELi4EZNS0_22gpu_kernel_impl_nocastIZZZNS0_23direct_copy_kernel_cudaERNS_18TensorIteratorBaseEENKUlvE1_clEvENKUlvE8_clEvEUlN3c104HalfEE_EEvS4_RKT_EUliE_EEviT1_
├─ nan 0.027 _ZN2at6native54_GLOBAL__N__a236ace4_21_DistributionNormal_cu_0c5b6e8543distribution_elementwise_grid_stride_kernelIfLi4EZNS0_9templates4cuda20normal_and_transformIN3c104HalfEfLm4EPNS_17CUDAGeneratorImplEZZZNS4_13normal_kernelIS9_EEvRKNS_10TensorBaseEddT_ENKUlvE_clEvENKUlvE1_clEvEUlfE_EEvRNS_18TensorIteratorBaseET2_T3_EUlP24curandStatePhilox4_32_10E0_ZNS1_27distribution_nullary_kernelIS7_fLi4ES9_SO_SH_EEvSJ_SK_RKSL_T4_EUlifE_EEviNS_15PhiloxCudaStateET1_SK_
├─ 283.506 2666.310 cublas [M=8192, N=8192, K=512]
│  └─ nan 2666.310 sm90_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_off_kernel__5x_cublas
├─ 223.326 307.709 matmul_kernel [M=8192, N=8192, K=512]
├─ 259.293 265.027 matmul_kernel_descriptor_persistent [M=8192, N=8192, K=512]
├─ 238.500 288.133 matmul_kernel_persistent [M=8192, N=8192, K=512]
├─ 258.738 265.594 matmul_kernel_tma_persistent [M=8192, N=8192, K=512]
└─ 295.529 232.531 torch [M=8192, N=8192, K=512]
   └─ nan 232.531 sm90_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_off_kernel__5x_cublas

Legend (Metric: tflop16/s (inc) Min: 223.33 Max: 295.53)
█ 288.31 - 295.53
█ 273.87 - 288.31
█ 259.43 - 273.87
█ 244.99 - 259.43
█ 230.55 - 244.99
█ 223.33 - 230.55

name User code    ◀  Only in left graph    ▶  Only in right graph

root@dev-0:~/code/triton$ python python/tutorials/09-persistent-matmul.py
M=32, N=32, K=32 verification naive vs: torch: ✅ cublas: ✅ persistent: ✅ TMA persistent: ✅ Tensor descriptor persistent: ✅
M=8192, N=8192, K=512 verification naive vs: torch: ✅ cublas: ✅ persistent: ✅ TMA persistent: ✅ Tensor descriptor persistent: ✅
273.367 4022.105 ROOT
├─ nan 0.031 _ZN2at6native18elementwise_kernelILi128ELi4EZNS0_22gpu_kernel_impl_nocastIZZZNS0_23direct_copy_kernel_cudaERNS_18TensorIteratorBaseEENKUlvE1_clEvENKUlvE8_clEvEUlN3c104HalfEE_EEvS4_RKT_EUliE_EEviT1_
├─ nan 0.027 _ZN2at6native54_GLOBAL__N__a236ace4_21_DistributionNormal_cu_0c5b6e8543distribution_elementwise_grid_stride_kernelIfLi4EZNS0_9templates4cuda20normal_and_transformIN3c104HalfEfLm4EPNS_17CUDAGeneratorImplEZZZNS4_13normal_kernelIS9_EEvRKNS_10TensorBaseEddT_ENKUlvE_clEvENKUlvE1_clEvEUlfE_EEvRNS_18TensorIteratorBaseET2_T3_EUlP24curandStatePhilox4_32_10E0_ZNS1_27distribution_nullary_kernelIS7_fLi4ES9_SO_SH_EEvSJ_SK_RKSL_T4_EUlifE_EEviNS_15PhiloxCudaStateET1_SK_
├─ 284.284 2659.011 cublas [M=8192, N=8192, K=512]
│  └─ nan 2659.011 sm90_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_off_kernel__5x_cublas
├─ 221.823 309.795 matmul_kernel [M=8192, N=8192, K=512]
├─ 254.755 269.748 matmul_kernel_descriptor_persistent [M=8192, N=8192, K=512]
├─ 240.774 285.411 matmul_kernel_persistent [M=8192, N=8192, K=512]
├─ 259.109 265.214 matmul_kernel_tma_persistent [M=8192, N=8192, K=512]
└─ 295.100 232.868 torch [M=8192, N=8192, K=512]
   └─ nan 232.868 sm90_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_off_kernel__5x_cublas

Legend (Metric: tflop16/s (inc) Min: 221.82 Max: 295.10)
█ 287.77 - 295.10
█ 273.12 - 287.77
█ 258.46 - 273.12
█ 243.81 - 258.46
█ 229.15 - 243.81
█ 221.82 - 229.15

name User code    ◀  Only in left graph    ▶  Only in right graph

```

After:

```
root@dev-0:~/code/triton$ python python/tutorials/09-persistent-matmul.py
M=32, N=32, K=32 verification naive vs: torch: ✅ cublas: ✅ persistent: ✅ TMA persistent: ✅ Tensor descriptor persistent: ✅
M=8192, N=8192, K=512 verification naive vs: torch: ✅ cublas: ✅ persistent: ✅ TMA persistent: ✅ Tensor descriptor persistent: ✅
274.040 4012.227 ROOT
├─ nan 0.031 _ZN2at6native18elementwise_kernelILi128ELi4EZNS0_22gpu_kernel_impl_nocastIZZZNS0_23direct_copy_kernel_cudaERNS_18TensorIteratorBaseEENKUlvE1_clEvENKUlvE8_clEvEUlN3c104HalfEE_EEvS4_RKT_EUliE_EEviT1_
├─ nan 0.027 _ZN2at6native54_GLOBAL__N__a236ace4_21_DistributionNormal_cu_0c5b6e8543distribution_elementwise_grid_stride_kernelIfLi4EZNS0_9templates4cuda20normal_and_transformIN3c104HalfEfLm4EPNS_17CUDAGeneratorImplEZZZNS4_13normal_kernelIS9_EEvRKNS_10TensorBaseEddT_ENKUlvE_clEvENKUlvE1_clEvEUlfE_EEvRNS_18TensorIteratorBaseET2_T3_EUlP24curandStatePhilox4_32_10E0_ZNS1_27distribution_nullary_kernelIS7_fLi4ES9_SO_SH_EEvSJ_SK_RKSL_T4_EUlifE_EEviNS_15PhiloxCudaStateET1_SK_
├─ 285.369 2648.904 cublas [M=8192, N=8192, K=512]
│  └─ nan 2648.904 sm90_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_off_kernel__5x_cublas
├─ 217.548 315.881 matmul_kernel [M=8192, N=8192, K=512]
├─ 262.312 261.976 matmul_kernel_descriptor_persistent [M=8192, N=8192, K=512]
├─ 244.740 280.785 matmul_kernel_persistent [M=8192, N=8192, K=512]
├─ 255.113 269.368 matmul_kernel_tma_persistent [M=8192, N=8192, K=512]
└─ 292.108 235.253 torch [M=8192, N=8192, K=512]
   └─ nan 235.253 sm90_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_off_kernel__5x_cublas

Legend (Metric: tflop16/s (inc) Min: 217.55 Max: 292.11)
█ 284.65 - 292.11
█ 269.74 - 284.65
█ 254.83 - 269.74
█ 239.92 - 254.83
█ 225.00 - 239.92
█ 217.55 - 225.00

name User code    ◀  Only in left graph    ▶  Only in right graph

root@dev-0:~/code/triton$ python python/tutorials/09-persistent-matmul.py
M=32, N=32, K=32 verification naive vs: torch: ✅ cublas: ✅ persistent: ✅ TMA persistent: ✅ Tensor descriptor persistent: ✅
M=8192, N=8192, K=512 verification naive vs: torch: ✅ cublas: ✅ persistent: ✅ TMA persistent: ✅ Tensor descriptor persistent: ✅
274.997 3998.267 ROOT
├─ nan 0.031 _ZN2at6native18elementwise_kernelILi128ELi4EZNS0_22gpu_kernel_impl_nocastIZZZNS0_23direct_copy_kernel_cudaERNS_18TensorIteratorBaseEENKUlvE1_clEvENKUlvE8_clEvEUlN3c104HalfEE_EEvS4_RKT_EUliE_EEviT1_
├─ nan 0.027 _ZN2at6native54_GLOBAL__N__a236ace4_21_DistributionNormal_cu_0c5b6e8543distribution_elementwise_grid_stride_kernelIfLi4EZNS0_9templates4cuda20normal_and_transformIN3c104HalfEfLm4EPNS_17CUDAGeneratorImplEZZZNS4_13normal_kernelIS9_EEvRKNS_10TensorBaseEddT_ENKUlvE_clEvENKUlvE1_clEvEUlfE_EEvRNS_18TensorIteratorBaseET2_T3_EUlP24curandStatePhilox4_32_10E0_ZNS1_27distribution_nullary_kernelIS7_fLi4ES9_SO_SH_EEvSJ_SK_RKSL_T4_EUlifE_EEviNS_15PhiloxCudaStateET1_SK_
├─ 285.498 2647.706 cublas [M=8192, N=8192, K=512]
│  └─ nan 2647.706 sm90_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_off_kernel__5x_cublas
├─ 217.884 315.394 matmul_kernel [M=8192, N=8192, K=512]
├─ 262.534 261.755 matmul_kernel_descriptor_persistent [M=8192, N=8192, K=512]
├─ 246.617 278.649 matmul_kernel_persistent [M=8192, N=8192, K=512]
├─ 262.525 261.764 matmul_kernel_tma_persistent [M=8192, N=8192, K=512]
└─ 295.007 232.942 torch [M=8192, N=8192, K=512]
   └─ nan 232.942 sm90_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_off_kernel__5x_cublas

Legend (Metric: tflop16/s (inc) Min: 217.88 Max: 295.01)
█ 287.29 - 295.01
█ 271.87 - 287.29
█ 256.45 - 271.87
█ 241.02 - 256.45
█ 225.60 - 241.02
█ 217.88 - 225.60

name User code    ◀  Only in left graph    ▶  Only in right graph

```

[Blackwell] Hoist constant TMem allocation out of the loop (#5857)

For cases where TMem is constant hoisting the allocation of the loop
avoid having to store multiple times.

[AMD] Use LLVM ops for fp16<->fp32 casts (#5859)

Inline assembly can be a blocker for LLVM backend to optimize.

[BW][PIPELINE] Add an option to tl.range to disallow accumulator multi-buffering (#5858)

Rework mmav5 pipelining to allow pipelining of mma when multibuffering
of the accumulator is impossible by putting uses in the same stage as
the mma and blocking on wait until current mma finishes.
Based on this support, introducing new flag to `tl.range` that controls
if multibuffering of the accumulator of the dots in the loop is allowed.
Without the mentioned rework of mmav5 pipelining we would simply not
pipeline cases where mutibuffering is disallowed.

[AMD] NFC: Refactor DotOpMFMAConversionHelper (#5862)

This PR refactored `DotOpMFMAConversionHelper` by extracting utility
functions from `convertDot` to make it easier to be extended in
#5845.

[AMD][CI] Switch to rocm hosted docker image (#5855)

The CI docker image has been moved from `rocmshared/pytorch` to
`rocm/pytorch` so we need to adjust our workflow.

[triton][NFC] Refactor build function of ScanOp (#5861)

It's better to use ScanOp's build function to replace ReduceOp's.

<!---
The core Triton is a small number of people, and we receive many PRs
(thank
you!).  To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them.
-->

- [x] I am not making a trivial change, such as fixing a typo in a
comment.

- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [ ] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
  - [x] This PR does not need a test because `FILL THIS IN`.

- Select one of the following.
  - [x] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)
  • Loading branch information
Jokeren committed Feb 10, 2025
1 parent 61e4b07 commit c9a21b4
Show file tree
Hide file tree
Showing 132 changed files with 4,071 additions and 1,839 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/integration-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -288,7 +288,7 @@ jobs:
runner: ${{fromJson(needs.Runner-Preparation.outputs.matrix-HIP)}}
name: Integration-Tests (${{matrix.runner[1] == 'gfx90a' && 'mi210' || 'mi300x'}})
container:
image: rocmshared/pytorch:rocm6.2.2_ubuntu22.04_py3.10_pytorch_2.5.1_asan
image: rocm/pytorch:rocm6.2.2_ubuntu22.04_py3.10_pytorch_2.5.1_asan
options: --device=/dev/kfd --device=/dev/dri --security-opt seccomp=unconfined --group-add video --user root
steps:
- name: Checkout
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/integration-tests.yml.in
Original file line number Diff line number Diff line change
Expand Up @@ -334,7 +334,7 @@ jobs:
name: Integration-Tests (${{matrix.runner[1] == 'gfx90a' && 'mi210' || 'mi300x'}})

container:
image: rocmshared/pytorch:rocm6.2.2_ubuntu22.04_py3.10_pytorch_2.5.1_asan
image: rocm/pytorch:rocm6.2.2_ubuntu22.04_py3.10_pytorch_2.5.1_asan
options: --device=/dev/kfd --device=/dev/dri --security-opt seccomp=unconfined --group-add video --user root

steps:
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/wheels_v2.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ jobs:
docker container prune -f
- name: Checkout
uses: actions/checkout@v3
uses: actions/checkout@v4

# The LATEST_DATE here should be kept in sync with the one in Patch setup.py
- id: check-version
Expand Down
18 changes: 18 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -284,3 +284,21 @@ Supported Hardware:
- NVIDIA GPUs (Compute Capability 8.0+)
- AMD GPUs (ROCm 6.2+)
- Under development: CPUs

# Development Container (Dev Container)

**Dev Containers** for the Triton project are available from
the [triton-dev-containers repository](https://github.com/redhat-et/triton-dev-containers)

### Key Benefits:
- **Consistency**: All developers can work with the same development
environment, ensuring uniform behavior across different systems.
- **Isolation**: The container prevents potential conflicts with software
installed on your local machine.
- **Portability**: Easily share the development environment with team members,
minimizing onboarding time and setup issues.

### How to Use the Dev Container:

For detailed instructions on how to use the dev containers please see
the [dev container user guide](https://github.com/redhat-et/triton-dev-containers/blob/main/.devcontainer/devcontainer.md)
2 changes: 1 addition & 1 deletion cmake/llvm-hash.txt
Original file line number Diff line number Diff line change
@@ -1 +1 @@
c118864223c6309378cd704f3406533474c2759f
ffe3129e9bdc146ee4d91e849173d1c64b1ae974
Original file line number Diff line number Diff line change
Expand Up @@ -28,20 +28,6 @@ constexpr int patternBenefitClampOptimizedPattern = 20;
constexpr int patternBenefitConvertLayoutOptimizedPattern = 20;
constexpr int patternBenefitNvidiaTensorCoreSubviewPattern = 20;

struct BackendCallbacks {
/**
* A backend-specific callback for appending auxiliary data during
* `LocalStoreOp` conversion.
*
* @param[in] op The reference to the re-written `LocalStoreOp`.
* @param[in] count The number of issued LLVM instructions.
* @param[in] type The input type of issued LLVM instructions.
*/
std::function<void(triton::gpu::LocalStoreOp op, size_t llvmOpCount,
Type llvmOpType)>
localStoreOpConversion = nullptr;
};

void populateElementwiseOpToLLVMPatterns(
LLVMTypeConverter &typeConverter, RewritePatternSet &patterns,
ModuleAxisInfoAnalysis &axisInfoAnalysis, const TargetInfoBase &targetInfo,
Expand All @@ -51,10 +37,10 @@ void populateElementwiseOpToLLVMPatterns(
// callback receives 1) the current source op, 2) the number of issued LLVM
// instructions and 3) their input types. Each MLIR backend can provide a
// callback and, thus, handle backend-specific behaviors.
void populateMemoryOpToLLVMPatterns(
LLVMTypeConverter &typeConverter, const TargetInfoBase &targetInfo,
RewritePatternSet &patterns, PatternBenefit benefit,
std::optional<BackendCallbacks> backendCallbacks = std::nullopt);
void populateMemoryOpToLLVMPatterns(LLVMTypeConverter &typeConverter,
const TargetInfoBase &targetInfo,
RewritePatternSet &patterns,
PatternBenefit benefit);

void populateAssertOpToLLVMPattern(LLVMTypeConverter &typeConverter,
RewritePatternSet &patterns,
Expand Down
5 changes: 5 additions & 0 deletions include/triton/Conversion/TritonGPUToLLVM/TargetInfoBase.h
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,11 @@ class TargetInfoBase {

virtual bool supportVectorizedAtomics() const = 0;

// Helper used by targets to annotate store operations during lowering to
// llvm.
virtual void storeOpAnnotation(triton::gpu::LocalStoreOp op,
size_t localStoreOpCount, Type type) const {}

virtual ~TargetInfoBase() {}
};
} // namespace mlir::triton
Expand Down
2 changes: 2 additions & 0 deletions include/triton/Dialect/Triton/IR/OpInterfaces.h
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ namespace impl {

LogicalResult verifyTransposeOpInterface(Operation *op);

LogicalResult verifyDotOpInterface(Operation *op);

} // namespace impl

} // namespace triton
Expand Down
47 changes: 0 additions & 47 deletions include/triton/Dialect/Triton/IR/Traits.h
Original file line number Diff line number Diff line change
Expand Up @@ -58,53 +58,6 @@ class VerifyTensorLayoutsTrait
}
};

// Verify if the op is a dot-like operation.
// A dot-like operation should have three operands.
// The first two operands should share a common dimension, and the result
// should have the dimensions of the two operands that are not shared.
// A dot-like operation can be either 2d or 3d.
// In the 3d case, the first dimension of operands is the batch dimension.
template <class ConcreteType>
class DotLike : public TraitBase<ConcreteType, DotLike> {
public:
static LogicalResult verifyTrait(Operation *op) {
if (op->getNumOperands() < 3)
return op->emitOpError("expected at least 3 operands");
auto aTy = cast<ShapedType>(op->getOperand(0).getType());
auto bTy = cast<ShapedType>(op->getOperand(1).getType());
auto cTy = cast<ShapedType>(op->getOperand(2).getType());
auto aShape = aTy.getShape();
auto bShape = bTy.getShape();
auto cShape = cTy.getShape();
// Check if all 3d or all 2d
if (aShape.size() != 2 && aShape.size() != 3)
return op->emitOpError("expected operands to be 2d or 3d");
if (aShape.size() != bShape.size() || aShape.size() != cShape.size())
return op->emitOpError("expected all operands to have the same rank");
// Check if the first two operands share a common dimension
// TODO: enable back with an interface to support scaled dot.
// if (aShape[aShape.size() - 1] != bShape[aShape.size() - 2])
// return op->emitOpError("expected the last dimension of the first
// operand "
// "to be equal to the second-to-last dimension of
// " "the second operand");
// Check the batch dimension
if (aShape.size() == 3 &&
(aShape[0] != cShape[0] || bShape[0] != cShape[0]))
return op->emitOpError("expected the first dimension of the first "
"operand to be equal to the first dimension of "
"the result");
// Check the output shape
if (cShape[cShape.size() - 2] != aShape[aShape.size() - 2] ||
cShape[cShape.size() - 1] != bShape[aShape.size() - 1])
return op->emitOpError(
"expected the output shape to be the concatenation of the last "
"dimension of the first operand and the last dimension of the "
"second ");
return success();
}
};

template <typename ConcreteType>
class SameOperandsAndResultEncoding
: public TraitBase<ConcreteType, SameOperandsAndResultEncoding> {
Expand Down
1 change: 0 additions & 1 deletion include/triton/Dialect/Triton/IR/TritonInterfaces.td
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@ include "mlir/Interfaces/InferTypeOpInterface.td"

def TensorSizeTrait : NativeOpTrait<"TensorSizeTrait">;
def VerifyTensorLayoutsTrait : NativeOpTrait<"VerifyTensorLayoutsTrait">;
def DotLike : NativeOpTrait<"DotLike">;
def SameOperandsEncoding : NativeOpTrait<"SameOperandsEncoding">;
def SameOperandsAndResultEncoding : NativeOpTrait<"SameOperandsAndResultEncoding">;
def SameLoadStoreOperandsShape : NativeOpTrait<"SameLoadStoreOperandsShape">;
Expand Down
22 changes: 21 additions & 1 deletion include/triton/Dialect/Triton/IR/TritonOpInterfaces.td
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,27 @@ def TransposeOpInterface : OpInterface<"TransposeOpInterface"> {
/*args=*/(ins)>
];

let verify = [{ return ::mlir::triton::impl::verifyTransposeOpInterface($_op); }];
let verify = [{ return ::mlir::triton::impl::verifyTransposeOpInterface($_op); }];
}

def DotOpInterface : OpInterface<"DotOpInterface"> {
let description = [{
This interface is implemented by operations that perform a dot product.
}];

let cppNamespace = "::mlir::triton";

let methods = [
InterfaceMethod<
/*desc=*/[{
Verifies the dimensions of the A and B DotOp operands.
}],
/*retType=*/"bool",
/*methodName=*/"verifyDims",
/*args=*/(ins)>
];

let verify = [{ return ::mlir::triton::impl::verifyDotOpInterface($_op); }];
}


Expand Down
4 changes: 2 additions & 2 deletions include/triton/Dialect/Triton/IR/TritonOps.td
Original file line number Diff line number Diff line change
Expand Up @@ -631,7 +631,7 @@ def TT_GetNumProgramsOp : TT_Op<"get_num_programs", [Pure]> {
//
def TT_DotOp : TT_Op<"dot", [Pure,
DeclareOpInterfaceMethods<InferTypeOpInterface>,
DotLike,
DeclareOpInterfaceMethods<DotOpInterface>,
TypesMatchWith<"result's type matches accumulator's type",
"d", "c", "$_self">]> {
let summary = "dot";
Expand Down Expand Up @@ -671,7 +671,7 @@ def TT_DotOp : TT_Op<"dot", [Pure,
//
def TT_DotScaledOp : TT_Op<"dot_scaled", [Pure,
AttrSizedOperandSegments,
DotLike,
DeclareOpInterfaceMethods<DotOpInterface>,
TypesMatchWith<"result's type matches accumulator's type",
"d", "c", "$_self">]> {
let summary = "dot_scaled";
Expand Down
6 changes: 4 additions & 2 deletions include/triton/Dialect/TritonGPU/IR/Dialect.h
Original file line number Diff line number Diff line change
Expand Up @@ -70,10 +70,12 @@ struct SharedMemory : public SideEffects::Resource::Base<SharedMemory> {
StringRef getName() final { return "<SharedMemory>"; }
};

// Convert a distributed layout to a linear encoding
LinearEncodingAttr toLinearEncoding(Attribute layout, ArrayRef<int64_t> shape);

unsigned getTotalElemsPerThread(Type type);

unsigned getTotalElemsPerThread(Attribute layout, ArrayRef<int64_t> shape,
Type eltTy);
unsigned getTotalElemsPerThread(Attribute layout, ArrayRef<int64_t> shape);

SmallVector<unsigned> getElemsPerThread(Type type);

Expand Down
21 changes: 13 additions & 8 deletions include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td
Original file line number Diff line number Diff line change
Expand Up @@ -501,13 +501,17 @@ We call each individual tile "rep".
InterfaceMethod<"Return total element size per thread.",
"unsigned",
"getTotalElemsPerThread",
(ins "ArrayRef<int64_t>":$tensorShape,
"Type":$eltTy)>,
(ins "ArrayRef<int64_t>":$shape),
/*defaultImplementation=*/[{
return toLinearEncoding($_self, shape).getTotalElemsPerThread(shape);
}]>,
InterfaceMethod<"Return element size per thread in each dimension.",
"SmallVector<unsigned>",
"getElemsPerThread",
(ins "ArrayRef<int64_t>":$tensorShape,
"Type":$eltTy)>,
(ins "ArrayRef<int64_t>":$shape),
/*defaultImplementation=*/[{
return toLinearEncoding($_self, shape).getElemsPerThread(shape);
}]>,
// Interface for the meta information about the multiple thread hierarchy.
InterfaceMethod<"Get the shape of the warps per CTA.",
"SmallVector<unsigned>",
Expand Down Expand Up @@ -577,8 +581,7 @@ L(T) = [ {0,8} , {1,9} , {2,10}, {3,11}, {0,8} , {1, 9} , {2, 10}, {3, 11},
}];

code extraDistributedDeclaration = extraBaseClassDeclaration # [{
unsigned getTotalElemsPerThread(ArrayRef<int64_t> shape, Type eltTy) const;
SmallVector<unsigned> getElemsPerThread(ArrayRef<int64_t> shape, Type eltTy) const;
// Implemented in subclasses
SmallVector<unsigned> getRepOrder() const;
SmallVector<unsigned> getCTAsPerCGA() const;
SmallVector<unsigned> getCTAOrder() const;
Expand Down Expand Up @@ -613,6 +616,10 @@ def LinearEncodingAttr : DistributedEncoding<"LinearEncoding", "linear_encoding"
let parameters = (ins LinearLayoutParam:$linearLayout);

let extraClassDeclaration = extraDistributedDeclaration # [{
// Generic distributed encoding methods
unsigned getTotalElemsPerThread(ArrayRef<int64_t> shape) const;
SmallVector<unsigned> getElemsPerThread(ArrayRef<int64_t> shape) const;

SmallVector<unsigned> getContigPerThread() const;
SmallVector<unsigned> getOrder() const;

Expand Down Expand Up @@ -965,7 +972,6 @@ V [ 0,4,8...60 1,5...61 2,6...62 3,7...63 ] [ 128,132...188 129,
return true;
}
SmallVector<unsigned> getSizePerThreadForOperand(int kWidth, int opIdx) const;
unsigned getTotalElemsPerThreadForOperand(ArrayRef<int64_t> shape, Type eltTy, int kWidth, int opIdx) const;
SmallVector<int64_t> getInstrShapeForOperand(int kWidth, int opIdx) const;
SmallVector<int64_t> getRepForOperand(ArrayRef<int64_t> operandShape, int kWidth, int opIdx) const;
SmallVector<unsigned> getRepOrderForOperand(int opIdx) const;
Expand Down Expand Up @@ -1095,7 +1101,6 @@ Row |
return true;
}
SmallVector<unsigned> getSizePerThreadForOperand(int kWidth, int opIdx) const;
unsigned getTotalElemsPerThreadForOperand(ArrayRef<int64_t> shape, Type eltTy, int kWidth, int opIdx) const;
SmallVector<int64_t> getElemsPerInstrForOperands() const;
SmallVector<int64_t> getRepForOperand(ArrayRef<int64_t> operandShape,
Type elemType, int kWidth, int opIdx) const;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ namespace mlir {
namespace triton {

static const char *kNumStagesAttrName = "tt.num_stages";
static const char *kDisallowAccMultiBufferAttrName =
"tt.disallow_acc_multi_buffer";
static const char *kLoopStageAttrName = "loop.stage";
static const char *kLoopClusterAttrName = "loop.cluster";

Expand Down Expand Up @@ -37,6 +39,10 @@ void addOps(scf::ForOp forOp, int stage,
void replaceUsesAndPropagateType(OpBuilder &builder, Operation *oldUse,
Value val);

// Return true if the given ForOp has the attribute
// `tt.disallow_acc_multi_buffer` set to true.
bool getDisallowAccMultiBuffer(scf::ForOp forOp);

// Return the minClusterId and maxClusterId for the given ForOp.
std::pair<int, int> getMinMaxCluster(scf::ForOp &forOp);
std::pair<int, int> getStageCluster(Operation *op);
Expand Down
8 changes: 0 additions & 8 deletions include/triton/Dialect/TritonGPU/Transforms/Utility.h
Original file line number Diff line number Diff line change
Expand Up @@ -200,14 +200,6 @@ StringRef getAMDArch(Operation *module);
std::optional<mlir::triton::gpu::SwizzledSharedEncodingAttr>
getSharedEncIfAllUsersAreDotEnc(Value val, bool &incompatible);

enum class MMALoadType {
SharedV3,
Registers, // may be v2 or v3
DoNotPipeline, // could be a valid shared/registers MMA operand, but skip
// pipelining
};
MMALoadType getMMALoadType(Operation *loadOp);

// Convert \param op operands and results to layout \param encoding.
void convertOpEncoding(Attribute encoding, Operation *op);

Expand Down
5 changes: 5 additions & 0 deletions include/triton/Dialect/TritonNvidiaGPU/IR/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,8 @@ mlir_tablegen(TritonNvidiaGPUAttrDefs.cpp.inc -gen-attrdef-defs)
mlir_tablegen(OpsEnums.h.inc -gen-enum-decls)
mlir_tablegen(OpsEnums.cpp.inc -gen-enum-defs)
add_public_tablegen_target(TritonNvidiaGPUAttrDefsIncGen)

set(LLVM_TARGET_DEFINITIONS TritonNvidiaGPUOpInterfaces.td)
mlir_tablegen(TritonNvidiaGPUOpInterfaces.h.inc -gen-op-interface-decls)
mlir_tablegen(TritonNvidiaGPUOpInterfaces.cpp.inc -gen-op-interface-defs)
add_public_tablegen_target(TritonNvidiaGPUOpInterfacesIncGen)
2 changes: 2 additions & 0 deletions include/triton/Dialect/TritonNvidiaGPU/IR/Dialect.h
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,8 @@
#define GET_ATTRDEF_CLASSES
#include "triton/Dialect/TritonNvidiaGPU/IR/TritonNvidiaGPUAttrDefs.h.inc"

#include "triton/Dialect/TritonNvidiaGPU/IR/TritonNvidiaGPUOpInterfaces.h.inc"

#define GET_OP_CLASSES
#include "triton/Dialect/TritonNvidiaGPU/IR/Ops.h.inc"

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
#ifndef TRITON_NVIDIAGPU_OP_INTERFACES
#define TRITON_NVIDIAGPU_OP_INTERFACES

include "mlir/IR/OpBase.td"

def MMAv5OpInterface : OpInterface<"MMAv5OpInterface"> {
let description = [{
This interface is implemented by MMAv5 dot and dot scaled ops.
}];

let cppNamespace = "::mlir::triton::nvidia_gpu";

// We can add more methods as needed.
let methods = [
InterfaceMethod<"Return the accumulator init flag.",
"::mlir::Value",
"useAccumulator">,
InterfaceMethod<"Set the accumulator init flag.",
"void",
"setUseAccumulator",
(ins "::mlir::Value":$flag)>,
InterfaceMethod<"Associate a new barrier to this MMAv5 op.",
"void",
"setBarrier",
(ins "::mlir::Value":$barrier)>,
InterfaceMethod<"Return the accumulator.",
"::mlir::Value",
"getAccumulator">,
InterfaceMethod<"Set the accumulator.",
"void",
"setAccumulator",
(ins "::mlir::Value":$accum)>,
InterfaceMethod<"Return the predicate of this op.",
"::mlir::Value",
"getPredicate">,
InterfaceMethod<"Set the predicate of this op.",
"void",
"setPredicate",
(ins "::mlir::Value":$pred)>,
];
}
#endif // TRITON_NVIDIAGPU_OP_INTERFACES
Loading

0 comments on commit c9a21b4

Please sign in to comment.