Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROTON-DEV] Restructure files #5846

Closed
wants to merge 5 commits into from
Closed

Conversation

Jokeren
Copy link
Contributor

@Jokeren Jokeren commented Feb 7, 2025

No description provided.

@Jokeren Jokeren force-pushed the keren/proton-dev-gpu branch 3 times, most recently from c1e53ff to 04b6c94 Compare February 10, 2025 16:31
Jokeren and others added 3 commits February 10, 2025 11:32
[Blackwell] Enable MMA pipelining for scaled dot when TMEM copy is used (#5812)

This PR enables MMA pipelining for scaled dot.

The main difficulty this PR overcomes is the dependency cycle between
TMEM copy rewriting and SWP - currently TMEM copy rewriting relies on
SWP to put loading of scales into SMEM, while to apply MMA pipelining
during SWP, TMEM copy rewriting needs to have happened beforehand. I
propose to break the cycle by having loading of scales go through
`local_alloc` and `local_load` in `AccelerateMatmul`. This way, TMEM
copy rewriting happens during [the first call to
OptimizedDotOperands,](https://github.com/triton-lang/triton/blob/1e0e51c4aeb3e1beea000da5d0e494f8b9ac40dd/third_party/nvidia/backend/compiler.py#L260)
before SWP. And the local alloc and load added in `AccelerateMatmul` are
eliminated during SWP. It's a bit ad hoc to add local alloc for scales
there, since scales do not need to be in SMEM. But other solutions, like
decoupling MMA pipelining from SWP, is more difficult.

The other changes in this PR are for making SWP recognize loading of
scales when there is TMEM copy between scale load and MMA.

@ThomasRaoux @pawelszczerbuk @csullivan @mbrookhart @binarybana

---------

Co-authored-by: Masahiro Masuda <[email protected]>
Co-authored-by: Jason Knight <[email protected]>

Fix default num_stages values mismatch between Python frontend and MLIR. (#5804)

Address the discrepancy between the default value of num_stages in the
Python frontend and the MLIR implementation. Previously, the values were
not aligned, which could cause inconsistencies in behavior across Python
Frontend and triton-opt.

This PR updates the default value of num_stages in the Python frontend
to match the ones used in MLIR.
-
[TritonGPUPipeline](https://github.com/triton-lang/triton/blob/032fa41a45847cdc00119ed3bdd5bc0adab9c938/include/triton/Dialect/TritonGPU/Transforms/Passes.td#L21)
-
[TritonGPUTestPipelineAssignLatencies](https://github.com/triton-lang/triton/blob/032fa41a45847cdc00119ed3bdd5bc0adab9c938/include/triton/Dialect/TritonGPU/Transforms/Passes.td#L58)
-
[TritonGPULoopScheduling](https://github.com/triton-lang/triton/blob/032fa41a45847cdc00119ed3bdd5bc0adab9c938/include/triton/Dialect/TritonGPU/Transforms/Passes.td#L266)

<!---
The core Triton is a small number of people, and we receive many PRs
(thank
you!).  To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them.
-->

- [x] I am not making a trivial change, such as fixing a typo in a
comment.

- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [ ] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
  - [x] This PR does not need a test because `FILL THIS IN`.

- Select one of the following.
  - [x] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)

Co-authored-by: pawelszczerbuk <[email protected]>

Bump actions/checkout from 3 to 4 (#5714)

Bumps [actions/checkout](https://github.com/actions/checkout) from 3 to
4.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/actions/checkout/releases">actions/checkout's
releases</a>.</em></p>
<blockquote>
<h2>v4.0.0</h2>
<h2>What's Changed</h2>
<ul>
<li>Update default runtime to node20 by <a
href="https://github.com/takost"><code>@​takost</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/1436">actions/checkout#1436</a></li>
<li>Support fetching without the --progress option by <a
href="https://github.com/simonbaird"><code>@​simonbaird</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/1067">actions/checkout#1067</a></li>
<li>Release 4.0.0 by <a
href="https://github.com/takost"><code>@​takost</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/1447">actions/checkout#1447</a></li>
</ul>
<h2>New Contributors</h2>
<ul>
<li><a href="https://github.com/takost"><code>@​takost</code></a> made
their first contribution in <a
href="https://redirect.github.com/actions/checkout/pull/1436">actions/checkout#1436</a></li>
<li><a
href="https://github.com/simonbaird"><code>@​simonbaird</code></a> made
their first contribution in <a
href="https://redirect.github.com/actions/checkout/pull/1067">actions/checkout#1067</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/actions/checkout/compare/v3...v4.0.0">https://github.com/actions/checkout/compare/v3...v4.0.0</a></p>
<h2>v3.6.0</h2>
<h2>What's Changed</h2>
<ul>
<li>Mark test scripts with Bash'isms to be run via Bash by <a
href="https://github.com/dscho"><code>@​dscho</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/1377">actions/checkout#1377</a></li>
<li>Add option to fetch tags even if fetch-depth &gt; 0 by <a
href="https://github.com/RobertWieczoreck"><code>@​RobertWieczoreck</code></a>
in <a
href="https://redirect.github.com/actions/checkout/pull/579">actions/checkout#579</a></li>
<li>Release 3.6.0 by <a
href="https://github.com/luketomlinson"><code>@​luketomlinson</code></a>
in <a
href="https://redirect.github.com/actions/checkout/pull/1437">actions/checkout#1437</a></li>
</ul>
<h2>New Contributors</h2>
<ul>
<li><a
href="https://github.com/RobertWieczoreck"><code>@​RobertWieczoreck</code></a>
made their first contribution in <a
href="https://redirect.github.com/actions/checkout/pull/579">actions/checkout#579</a></li>
<li><a
href="https://github.com/luketomlinson"><code>@​luketomlinson</code></a>
made their first contribution in <a
href="https://redirect.github.com/actions/checkout/pull/1437">actions/checkout#1437</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/actions/checkout/compare/v3.5.3...v3.6.0">https://github.com/actions/checkout/compare/v3.5.3...v3.6.0</a></p>
<h2>v3.5.3</h2>
<h2>What's Changed</h2>
<ul>
<li>Fix: Checkout Issue in self hosted runner due to faulty submodule
check-ins by <a
href="https://github.com/megamanics"><code>@​megamanics</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/1196">actions/checkout#1196</a></li>
<li>Fix typos found by codespell by <a
href="https://github.com/DimitriPapadopoulos"><code>@​DimitriPapadopoulos</code></a>
in <a
href="https://redirect.github.com/actions/checkout/pull/1287">actions/checkout#1287</a></li>
<li>Add support for sparse checkouts by <a
href="https://github.com/dscho"><code>@​dscho</code></a> and <a
href="https://github.com/dfdez"><code>@​dfdez</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/1369">actions/checkout#1369</a></li>
<li>Release v3.5.3 by <a
href="https://github.com/TingluoHuang"><code>@​TingluoHuang</code></a>
in <a
href="https://redirect.github.com/actions/checkout/pull/1376">actions/checkout#1376</a></li>
</ul>
<h2>New Contributors</h2>
<ul>
<li><a
href="https://github.com/megamanics"><code>@​megamanics</code></a> made
their first contribution in <a
href="https://redirect.github.com/actions/checkout/pull/1196">actions/checkout#1196</a></li>
<li><a
href="https://github.com/DimitriPapadopoulos"><code>@​DimitriPapadopoulos</code></a>
made their first contribution in <a
href="https://redirect.github.com/actions/checkout/pull/1287">actions/checkout#1287</a></li>
<li><a href="https://github.com/dfdez"><code>@​dfdez</code></a> made
their first contribution in <a
href="https://redirect.github.com/actions/checkout/pull/1369">actions/checkout#1369</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/actions/checkout/compare/v3...v3.5.3">https://github.com/actions/checkout/compare/v3...v3.5.3</a></p>
<h2>v3.5.2</h2>
<h2>What's Changed</h2>
<ul>
<li>Fix: Use correct API url / endpoint in GHES by <a
href="https://github.com/fhammerl"><code>@​fhammerl</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/1289">actions/checkout#1289</a>
based on <a
href="https://redirect.github.com/actions/checkout/issues/1286">#1286</a>
by <a href="https://github.com/1newsr"><code>@​1newsr</code></a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/actions/checkout/compare/v3.5.1...v3.5.2">https://github.com/actions/checkout/compare/v3.5.1...v3.5.2</a></p>
<h2>v3.5.1</h2>
<h2>What's Changed</h2>
<ul>
<li>Improve checkout performance on Windows runners by upgrading
<code>@​actions/github</code> dependency by <a
href="https://github.com/BrettDong"><code>@​BrettDong</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/1246">actions/checkout#1246</a></li>
</ul>
<h2>New Contributors</h2>
<ul>
<li><a href="https://github.com/BrettDong"><code>@​BrettDong</code></a>
made their first contribution in <a
href="https://redirect.github.com/actions/checkout/pull/1246">actions/checkout#1246</a></li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/actions/checkout/blob/main/CHANGELOG.md">actions/checkout's
changelog</a>.</em></p>
<blockquote>
<h1>Changelog</h1>
<h2>v4.2.2</h2>
<ul>
<li><code>url-helper.ts</code> now leverages well-known environment
variables by <a href="https://github.com/jww3"><code>@​jww3</code></a>
in <a
href="https://redirect.github.com/actions/checkout/pull/1941">actions/checkout#1941</a></li>
<li>Expand unit test coverage for <code>isGhes</code> by <a
href="https://github.com/jww3"><code>@​jww3</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/1946">actions/checkout#1946</a></li>
</ul>
<h2>v4.2.1</h2>
<ul>
<li>Check out other refs/* by commit if provided, fall back to ref by <a
href="https://github.com/orhantoy"><code>@​orhantoy</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/1924">actions/checkout#1924</a></li>
</ul>
<h2>v4.2.0</h2>
<ul>
<li>Add Ref and Commit outputs by <a
href="https://github.com/lucacome"><code>@​lucacome</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/1180">actions/checkout#1180</a></li>
<li>Dependency updates by <a
href="https://github.com/dependabot"><code>@​dependabot</code></a>- <a
href="https://redirect.github.com/actions/checkout/pull/1777">actions/checkout#1777</a>,
<a
href="https://redirect.github.com/actions/checkout/pull/1872">actions/checkout#1872</a></li>
</ul>
<h2>v4.1.7</h2>
<ul>
<li>Bump the minor-npm-dependencies group across 1 directory with 4
updates by <a
href="https://github.com/dependabot"><code>@​dependabot</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/1739">actions/checkout#1739</a></li>
<li>Bump actions/checkout from 3 to 4 by <a
href="https://github.com/dependabot"><code>@​dependabot</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/1697">actions/checkout#1697</a></li>
<li>Check out other refs/* by commit by <a
href="https://github.com/orhantoy"><code>@​orhantoy</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/1774">actions/checkout#1774</a></li>
<li>Pin actions/checkout's own workflows to a known, good, stable
version. by <a href="https://github.com/jww3"><code>@​jww3</code></a> in
<a
href="https://redirect.github.com/actions/checkout/pull/1776">actions/checkout#1776</a></li>
</ul>
<h2>v4.1.6</h2>
<ul>
<li>Check platform to set archive extension appropriately by <a
href="https://github.com/cory-miller"><code>@​cory-miller</code></a> in
<a
href="https://redirect.github.com/actions/checkout/pull/1732">actions/checkout#1732</a></li>
</ul>
<h2>v4.1.5</h2>
<ul>
<li>Update NPM dependencies by <a
href="https://github.com/cory-miller"><code>@​cory-miller</code></a> in
<a
href="https://redirect.github.com/actions/checkout/pull/1703">actions/checkout#1703</a></li>
<li>Bump github/codeql-action from 2 to 3 by <a
href="https://github.com/dependabot"><code>@​dependabot</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/1694">actions/checkout#1694</a></li>
<li>Bump actions/setup-node from 1 to 4 by <a
href="https://github.com/dependabot"><code>@​dependabot</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/1696">actions/checkout#1696</a></li>
<li>Bump actions/upload-artifact from 2 to 4 by <a
href="https://github.com/dependabot"><code>@​dependabot</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/1695">actions/checkout#1695</a></li>
<li>README: Suggest <code>user.email</code> to be
<code>41898282+github-actions[bot]@users.noreply.github.com</code> by <a
href="https://github.com/cory-miller"><code>@​cory-miller</code></a> in
<a
href="https://redirect.github.com/actions/checkout/pull/1707">actions/checkout#1707</a></li>
</ul>
<h2>v4.1.4</h2>
<ul>
<li>Disable <code>extensions.worktreeConfig</code> when disabling
<code>sparse-checkout</code> by <a
href="https://github.com/jww3"><code>@​jww3</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/1692">actions/checkout#1692</a></li>
<li>Add dependabot config by <a
href="https://github.com/cory-miller"><code>@​cory-miller</code></a> in
<a
href="https://redirect.github.com/actions/checkout/pull/1688">actions/checkout#1688</a></li>
<li>Bump the minor-actions-dependencies group with 2 updates by <a
href="https://github.com/dependabot"><code>@​dependabot</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/1693">actions/checkout#1693</a></li>
<li>Bump word-wrap from 1.2.3 to 1.2.5 by <a
href="https://github.com/dependabot"><code>@​dependabot</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/1643">actions/checkout#1643</a></li>
</ul>
<h2>v4.1.3</h2>
<ul>
<li>Check git version before attempting to disable
<code>sparse-checkout</code> by <a
href="https://github.com/jww3"><code>@​jww3</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/1656">actions/checkout#1656</a></li>
<li>Add SSH user parameter by <a
href="https://github.com/cory-miller"><code>@​cory-miller</code></a> in
<a
href="https://redirect.github.com/actions/checkout/pull/1685">actions/checkout#1685</a></li>
<li>Update <code>actions/checkout</code> version in
<code>update-main-version.yml</code> by <a
href="https://github.com/jww3"><code>@​jww3</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/1650">actions/checkout#1650</a></li>
</ul>
<h2>v4.1.2</h2>
<ul>
<li>Fix: Disable sparse checkout whenever <code>sparse-checkout</code>
option is not present <a
href="https://github.com/dscho"><code>@​dscho</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/1598">actions/checkout#1598</a></li>
</ul>
<h2>v4.1.1</h2>
<ul>
<li>Correct link to GitHub Docs by <a
href="https://github.com/peterbe"><code>@​peterbe</code></a> in <a
href="https://redirect.github.com/actions/checkout/pull/1511">actions/checkout#1511</a></li>
<li>Link to release page from what's new section by <a
href="https://github.com/cory-miller"><code>@​cory-miller</code></a> in
<a
href="https://redirect.github.com/actions/checkout/pull/1514">actions/checkout#1514</a></li>
</ul>
<h2>v4.1.0</h2>
<ul>
<li><a href="https://redirect.github.com/actions/checkout/pull/1396">Add
support for partial checkout filters</a></li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/actions/checkout/commit/11bd71901bbe5b1630ceea73d27597364c9af683"><code>11bd719</code></a>
Prepare 4.2.2 Release (<a
href="https://redirect.github.com/actions/checkout/issues/1953">#1953</a>)</li>
<li><a
href="https://github.com/actions/checkout/commit/e3d2460bbb42d7710191569f88069044cfb9d8cf"><code>e3d2460</code></a>
Expand unit test coverage (<a
href="https://redirect.github.com/actions/checkout/issues/1946">#1946</a>)</li>
<li><a
href="https://github.com/actions/checkout/commit/163217dfcd28294438ea1c1c149cfaf66eec283e"><code>163217d</code></a>
<code>url-helper.ts</code> now leverages well-known environment
variables. (<a
href="https://redirect.github.com/actions/checkout/issues/1941">#1941</a>)</li>
<li><a
href="https://github.com/actions/checkout/commit/eef61447b9ff4aafe5dcd4e0bbf5d482be7e7871"><code>eef6144</code></a>
Prepare 4.2.1 release (<a
href="https://redirect.github.com/actions/checkout/issues/1925">#1925</a>)</li>
<li><a
href="https://github.com/actions/checkout/commit/6b42224f41ee5dfe5395e27c8b2746f1f9955030"><code>6b42224</code></a>
Add workflow file for publishing releases to immutable action package
(<a
href="https://redirect.github.com/actions/checkout/issues/1919">#1919</a>)</li>
<li><a
href="https://github.com/actions/checkout/commit/de5a000abf73b6f4965bd1bcdf8f8d94a56ea815"><code>de5a000</code></a>
Check out other refs/* by commit if provided, fall back to ref (<a
href="https://redirect.github.com/actions/checkout/issues/1924">#1924</a>)</li>
<li><a
href="https://github.com/actions/checkout/commit/d632683dd7b4114ad314bca15554477dd762a938"><code>d632683</code></a>
Prepare 4.2.0 release (<a
href="https://redirect.github.com/actions/checkout/issues/1878">#1878</a>)</li>
<li><a
href="https://github.com/actions/checkout/commit/6d193bf28034eafb982f37bd894289fe649468fc"><code>6d193bf</code></a>
Bump braces from 3.0.2 to 3.0.3 (<a
href="https://redirect.github.com/actions/checkout/issues/1777">#1777</a>)</li>
<li><a
href="https://github.com/actions/checkout/commit/db0cee9a514becbbd4a101a5fbbbf47865ee316c"><code>db0cee9</code></a>
Bump the minor-npm-dependencies group across 1 directory with 4 updates
(<a
href="https://redirect.github.com/actions/checkout/issues/1872">#1872</a>)</li>
<li><a
href="https://github.com/actions/checkout/commit/b6849436894e144dbce29d7d7fda2ae3bf9d8365"><code>b684943</code></a>
Add Ref and Commit outputs (<a
href="https://redirect.github.com/actions/checkout/issues/1180">#1180</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/actions/checkout/compare/v3...v4">compare
view</a></li>
</ul>
</details>
<br />

[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=actions/checkout&package-manager=github_actions&previous-version=3&new-version=4)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)

</details>

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[mlir][dialect] Refactor DotLike trait into a DotOpInterface + Enable verification of scaled_dot (#5796)

The Triton MLIR dialect presently has a `DotLike` trait which DotOps
have.

The problem with this trait is the way it is currently implemented
prevents `scaled_dot` from being verified properly (dimensions are not
properly checked at the moment: "TODO: enable back with an interface to
support scaled dot.").

This PR refactors the "DotLike" trait into an interface which implements
a "verifyDims" function that checks if the dims for the A and B operands
are compatible (e.g., something like MxK1 and K2xN; k1==k2; in the
simple case).

The initial implementation of DotOpInterface is similar to the prior
`DotLike` trait with the exception that it includes the `verifyDims`
function which all DotOps must implement---this function just checks
whether the dimensions of the A, B inputs match.

In the future this interface can be extended to include more
functionality.

I think since this enables the verifier for `scaled_dot`, that the
existing scaled dot tests should cover any changes made in the PR---but
if this is wrong I will add additional tests.

================================================================
<!---
The core Triton is a small number of people, and we receive many PRs
(thank
you!).  To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them.
-->

- [x] I am not making a trivial change, such as fixing a typo in a
comment.

- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [ ] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
- [x] This PR does not need a test because `I think this should be
covered by existing tests`.

- Select one of the following.
  - [x] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)

[ANALYSIS] Fixing overflow in AxisInfo.cpp. (#5821)

UBSan detected that the shift is happening on an int32 variable, but the
shift amount could be larger than what int32 supports.

I believe the change is trivial to require a test, but please advice
accordingly if you think there is an appropriate way to capture this.

[Blackwell][Clean up] Remove use of SharedMemoryObject on TMEM (#5817)

Using `getSharedMemoryObjectFromStruct` etc on TMEM is very confusing
for new readers, so I'm introducing simpler alternatives for TMEM. In
practice, we only need the base address of TMEM.

@ThomasRaoux

---------

Co-authored-by: Masahiro Masuda <[email protected]>

[AMD] Support lowering GPU async copy/commit/wait ops (#5729)

Support lowering of `ttg.async_copy_global_to_local` for `gfx9` GPUs.
The lowering does check if the resulting writes are coalesced which is a
requirement by the hardware. Also associated `ttg.async_commit_group`
and `ttg.async_wait`.

Note that we are currently not emitting `AsyncCopyGlobalToLocal` for
AMD targets, this will come with a follow up PR.

[AMD] Remove duplicate definition of use_buffer_ops (#5828)

Removes a duplicate definition of use_buffer_ops when loading the
`AMDGCN_USE_BUFFER_OPS` environment variable, likely due to a merge
conflict.

Co-authored-by: Nick Riasanovsky <[email protected]>

[PIPELINE] Relax requirements for wgmma operand register pipelining (#5810)

Pipeline wgmma operands in registers always if they are not suitable for
shmem pipelining.
This introduces a performance regression on attention workloads, but
fixes a crash introduced by
#5798.

[Blackwell][TUTORIALS] Add tutorial 10-block-scaled-matmul.py (#5813)

This tutorial demos Triton support for block scaled matrix multiply on
Blackwell's 5th generation tensor core with low precision FP4 and FP8
datatypes. Planned followups include optimized TMA loads for block scale
factors, and mixed precision support.

Additional changes
* Moves MX dtype helper classes to triton/tools/mxfp.py for use in
tutorials as well as test code.

@ThomasRaoux @pawelszczerbuk @masahi @mbrookhart @binarybana

Revert "[AMD] Use more efficient fp32 to bf16 type conversion (#5633)" (#5829)

This reverts commit 1c28e08 because the
inline assembly causing issues for LLVM backend.

[TensorDescriptor] Support ndim > 2 loads and stores (#5830)

This adds support for 3, 4 and 5 dimensional tensor descriptors with
tests that load and store work correctly. It was surprisingly easy as
most of the code was already written generically, there were only a few
changes needed to the TMA creation lowering, and a bug-fix for LL
conversion of NVMMA encoding with >2 dims.

[AMD] Avoid unneeded instruction schedule pass runs (#5832)

We only want to run them if an active variant is requested.

[PIPELINE] Including mmav5 dot in shmem pipelining (#5835)

There is a bug introduced in: [PIPELINE] Relax requirements for wgmma
operand register pipelining (#5810)
I was missing mmav5 dot support. Thanks for @peterbell10 for catching
that and for @masahi for proposing a fix.

Disable dot transpose optimization due to perf regressions (#5834)

There are cases where transposing the dot op may cause performance
problems. For example, if the accumulator is being used in something
like:
```
 %206 = ttg.convert_layout %205 : tensor<128x64xf16, #mma> -> tensor<128x64xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> loc(#loc216)
```
transposing it makes the conversion expensive. This is likely to happen
in attention kernels.

[AMD] Attach variant to the scheduling hint op (#5808)

This PR refactors the implementation of instruction scheduling
infrastructure. A particular "sched" variant becomes a part of the
instruction and gets added during the insertion pass. The instruction
carries this meta-information over many passes allowing to re-use the
same mechanism in some other places.

doc: update readme with dev container info (#5819)

Add a reference to https://github.com/redhat-et/triton-dev-containers
which contains a set of Development Containers for Triton. Development
containers have the following Key benefits:

- **Consistency**: All developers can work with the same development
environment, ensuring uniform behavior across different systems.
- **Isolation**: The container prevents potential conflicts with
software installed on your local machine.
- **Portability**: Easily share the development environment with team
members, minimizing onboarding time and setup issues.

<!---
The core Triton is a small number of people, and we receive many PRs
(thank
you!).  To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them.
-->

- [ ] I am not making a trivial change, such as fixing a typo in a
comment.

- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [ ] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
- [x] This PR does not need a test because `It's just a documentation
change`.

- Select one of the following.
  - [x] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)

Signed-off-by: Maryam Tahhan <[email protected]>

[NFC] Finish clean-up of #5834 (#5837)

That partial revert left a couple auxiliary functions that are not
currently needed.

[AMD] Avoiding undefined behaviour due to llvm_unreachable (#5822)

It is intended to always abort, but sometimes does something else. For
example, this test deadlocks on ARM sometimes, resulting in increased
flakyness.

This pattern has also been used elsewhere in the file, so this makes
them consistent.

[AMD] Enable block pingpong for smaller tiles (#5820)

Recent experiment found it also helps few more configs especially
smaller tiles.
Enable one cluster pingpong for the 4 times smaller tiles.

[Pipeliner] Fix crash in rewriting TMA descriptor updates (#5843)

Lots of our code assumes that `scf.if` has a non-empty else region, but
sometimes it can be empty, which typically happens due to one of the
`scf.if` canonicalizers. Just make sure to create `scf.if` with
non-empty regions.

This was split off from #5726
since others were hitting the crash.

[Pipelinier] Fix mmav3 pipelining (#5844)

Make sure we allocate the right number of slices when doing mmav3
pipelining.

[BACKEND] bump to llvm/llvm-project@ffe3129e9bdc (#5814)

Pulls in llvm/llvm-project#125268 which is
necessary for #5563.

[AMD][BACKEND] Enable cache ctrl bits for gfx950 (#5838)

Enables cache control bits based on Triton's `cacheModifier` on gfx950.
They are identical compared to `gfx942`.

[DOC] Fix generating docs for tutorials (#5850)

Do not `exit` when generating tutorials, which causes the whole
documentation pipeline exit.

[Tests] Re-enable tests and clean up test_matmul (#5853)

Reduce unnecessary config, re-enable a test and simplify simple_matmul

[Blackwell][Clean up] Introduce interface for MMAv5 ops (#5848)

The goal is to let `TCGen5MMAOp` and `TCGen5MMAScaledOp` share an
interface so that the rest of code can work generically with them. The
MMA pipelining pass gets cleaned up a lot, and the accum init flag
optimization is now automatically enabled for `TCGen5MMAScaledOp` as
well.

---------

Co-authored-by: Masahiro Masuda <[email protected]>

[NFC] Remove custom backend callback and move it to TargetInfo (#5854)

[LAYOUTS] Implement generically getElemsPerThread (#5841)

While doing so, we remove the SliceEncodingAttr hack!

[Pipeliner] Enable automatic loop fusion (#5726)

This PR turns on automatic loop fusion in the CUDA >= 8.0 pass
pipelines. Automatic loop fusion is only enabled for simple loop nests
(1 outer loop, 1 inner loop), when the user requests fusion with
`tl.range(..., fuse=True)` in the frontend.

This PR also rewrites the persistent matmul examples to use loop nests.
This is cleaner, but will also enable more powerful and flexible
optimizations of loop nests in the future. Primarily, it hides the
brittleless of the pipeliner behind a single layer inside the compiler,
so ideally the brittleness needs to be dealt with only once and hidden
from users.

To achieve this, several things have been added to loop fusion:

1. To avoid generating the inner loop inside a conditional, loop nest
fusion will "speculate" the length of the inner loop, essentially
generating a branch where the inner loop is missing and one where the
inner loop is always known to execute at least once.
2. Codegen of the loop induction variables has been slightly altered to
better match the expectations of the scheduler, pipeliner(s), and
`optimize-accumulator-init`.
3. Codegen of loop iter args has been altered to generate fewer SSA
dependencies between the prologue, inner loop, and epilogue, making it
more likely for pipelining to be successful. E.g., inner loop iter args
that can be initialized outside the loop and reset in the epilogue are
done so, rather than in the prologue.

Some other things in this PR:

* Fixed a bug in the pipeline expander
* Added AxisInfo implementation for `ub::PoisonOp`

I verified the performance of the rewritten persistent matmul kernels on
H100 and Blackwell.

Performance of `09-persistent-matmul.py` on H100.

Before (2 runs)

```
root@dev-0:~/code/triton$ python python/tutorials/09-persistent-matmul.py
M=32, N=32, K=32 verification naive vs: torch: ✅ cublas: ✅ persistent: ✅ TMA persistent: ✅ Tensor descriptor persistent: ✅
M=8192, N=8192, K=512 verification naive vs: torch: ✅ cublas: ✅ persistent: ✅ TMA persistent: ✅ Tensor descriptor persistent: ✅
273.146 4025.362 ROOT
├─ nan 0.031 _ZN2at6native18elementwise_kernelILi128ELi4EZNS0_22gpu_kernel_impl_nocastIZZZNS0_23direct_copy_kernel_cudaERNS_18TensorIteratorBaseEENKUlvE1_clEvENKUlvE8_clEvEUlN3c104HalfEE_EEvS4_RKT_EUliE_EEviT1_
├─ nan 0.027 _ZN2at6native54_GLOBAL__N__a236ace4_21_DistributionNormal_cu_0c5b6e8543distribution_elementwise_grid_stride_kernelIfLi4EZNS0_9templates4cuda20normal_and_transformIN3c104HalfEfLm4EPNS_17CUDAGeneratorImplEZZZNS4_13normal_kernelIS9_EEvRKNS_10TensorBaseEddT_ENKUlvE_clEvENKUlvE1_clEvEUlfE_EEvRNS_18TensorIteratorBaseET2_T3_EUlP24curandStatePhilox4_32_10E0_ZNS1_27distribution_nullary_kernelIS7_fLi4ES9_SO_SH_EEvSJ_SK_RKSL_T4_EUlifE_EEviNS_15PhiloxCudaStateET1_SK_
├─ 283.506 2666.310 cublas [M=8192, N=8192, K=512]
│  └─ nan 2666.310 sm90_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_off_kernel__5x_cublas
├─ 223.326 307.709 matmul_kernel [M=8192, N=8192, K=512]
├─ 259.293 265.027 matmul_kernel_descriptor_persistent [M=8192, N=8192, K=512]
├─ 238.500 288.133 matmul_kernel_persistent [M=8192, N=8192, K=512]
├─ 258.738 265.594 matmul_kernel_tma_persistent [M=8192, N=8192, K=512]
└─ 295.529 232.531 torch [M=8192, N=8192, K=512]
   └─ nan 232.531 sm90_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_off_kernel__5x_cublas

Legend (Metric: tflop16/s (inc) Min: 223.33 Max: 295.53)
█ 288.31 - 295.53
█ 273.87 - 288.31
█ 259.43 - 273.87
█ 244.99 - 259.43
█ 230.55 - 244.99
█ 223.33 - 230.55

name User code    ◀  Only in left graph    ▶  Only in right graph

root@dev-0:~/code/triton$ python python/tutorials/09-persistent-matmul.py
M=32, N=32, K=32 verification naive vs: torch: ✅ cublas: ✅ persistent: ✅ TMA persistent: ✅ Tensor descriptor persistent: ✅
M=8192, N=8192, K=512 verification naive vs: torch: ✅ cublas: ✅ persistent: ✅ TMA persistent: ✅ Tensor descriptor persistent: ✅
273.367 4022.105 ROOT
├─ nan 0.031 _ZN2at6native18elementwise_kernelILi128ELi4EZNS0_22gpu_kernel_impl_nocastIZZZNS0_23direct_copy_kernel_cudaERNS_18TensorIteratorBaseEENKUlvE1_clEvENKUlvE8_clEvEUlN3c104HalfEE_EEvS4_RKT_EUliE_EEviT1_
├─ nan 0.027 _ZN2at6native54_GLOBAL__N__a236ace4_21_DistributionNormal_cu_0c5b6e8543distribution_elementwise_grid_stride_kernelIfLi4EZNS0_9templates4cuda20normal_and_transformIN3c104HalfEfLm4EPNS_17CUDAGeneratorImplEZZZNS4_13normal_kernelIS9_EEvRKNS_10TensorBaseEddT_ENKUlvE_clEvENKUlvE1_clEvEUlfE_EEvRNS_18TensorIteratorBaseET2_T3_EUlP24curandStatePhilox4_32_10E0_ZNS1_27distribution_nullary_kernelIS7_fLi4ES9_SO_SH_EEvSJ_SK_RKSL_T4_EUlifE_EEviNS_15PhiloxCudaStateET1_SK_
├─ 284.284 2659.011 cublas [M=8192, N=8192, K=512]
│  └─ nan 2659.011 sm90_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_off_kernel__5x_cublas
├─ 221.823 309.795 matmul_kernel [M=8192, N=8192, K=512]
├─ 254.755 269.748 matmul_kernel_descriptor_persistent [M=8192, N=8192, K=512]
├─ 240.774 285.411 matmul_kernel_persistent [M=8192, N=8192, K=512]
├─ 259.109 265.214 matmul_kernel_tma_persistent [M=8192, N=8192, K=512]
└─ 295.100 232.868 torch [M=8192, N=8192, K=512]
   └─ nan 232.868 sm90_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_off_kernel__5x_cublas

Legend (Metric: tflop16/s (inc) Min: 221.82 Max: 295.10)
█ 287.77 - 295.10
█ 273.12 - 287.77
█ 258.46 - 273.12
█ 243.81 - 258.46
█ 229.15 - 243.81
█ 221.82 - 229.15

name User code    ◀  Only in left graph    ▶  Only in right graph

```

After:

```
root@dev-0:~/code/triton$ python python/tutorials/09-persistent-matmul.py
M=32, N=32, K=32 verification naive vs: torch: ✅ cublas: ✅ persistent: ✅ TMA persistent: ✅ Tensor descriptor persistent: ✅
M=8192, N=8192, K=512 verification naive vs: torch: ✅ cublas: ✅ persistent: ✅ TMA persistent: ✅ Tensor descriptor persistent: ✅
274.040 4012.227 ROOT
├─ nan 0.031 _ZN2at6native18elementwise_kernelILi128ELi4EZNS0_22gpu_kernel_impl_nocastIZZZNS0_23direct_copy_kernel_cudaERNS_18TensorIteratorBaseEENKUlvE1_clEvENKUlvE8_clEvEUlN3c104HalfEE_EEvS4_RKT_EUliE_EEviT1_
├─ nan 0.027 _ZN2at6native54_GLOBAL__N__a236ace4_21_DistributionNormal_cu_0c5b6e8543distribution_elementwise_grid_stride_kernelIfLi4EZNS0_9templates4cuda20normal_and_transformIN3c104HalfEfLm4EPNS_17CUDAGeneratorImplEZZZNS4_13normal_kernelIS9_EEvRKNS_10TensorBaseEddT_ENKUlvE_clEvENKUlvE1_clEvEUlfE_EEvRNS_18TensorIteratorBaseET2_T3_EUlP24curandStatePhilox4_32_10E0_ZNS1_27distribution_nullary_kernelIS7_fLi4ES9_SO_SH_EEvSJ_SK_RKSL_T4_EUlifE_EEviNS_15PhiloxCudaStateET1_SK_
├─ 285.369 2648.904 cublas [M=8192, N=8192, K=512]
│  └─ nan 2648.904 sm90_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_off_kernel__5x_cublas
├─ 217.548 315.881 matmul_kernel [M=8192, N=8192, K=512]
├─ 262.312 261.976 matmul_kernel_descriptor_persistent [M=8192, N=8192, K=512]
├─ 244.740 280.785 matmul_kernel_persistent [M=8192, N=8192, K=512]
├─ 255.113 269.368 matmul_kernel_tma_persistent [M=8192, N=8192, K=512]
└─ 292.108 235.253 torch [M=8192, N=8192, K=512]
   └─ nan 235.253 sm90_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_off_kernel__5x_cublas

Legend (Metric: tflop16/s (inc) Min: 217.55 Max: 292.11)
█ 284.65 - 292.11
█ 269.74 - 284.65
█ 254.83 - 269.74
█ 239.92 - 254.83
█ 225.00 - 239.92
█ 217.55 - 225.00

name User code    ◀  Only in left graph    ▶  Only in right graph

root@dev-0:~/code/triton$ python python/tutorials/09-persistent-matmul.py
M=32, N=32, K=32 verification naive vs: torch: ✅ cublas: ✅ persistent: ✅ TMA persistent: ✅ Tensor descriptor persistent: ✅
M=8192, N=8192, K=512 verification naive vs: torch: ✅ cublas: ✅ persistent: ✅ TMA persistent: ✅ Tensor descriptor persistent: ✅
274.997 3998.267 ROOT
├─ nan 0.031 _ZN2at6native18elementwise_kernelILi128ELi4EZNS0_22gpu_kernel_impl_nocastIZZZNS0_23direct_copy_kernel_cudaERNS_18TensorIteratorBaseEENKUlvE1_clEvENKUlvE8_clEvEUlN3c104HalfEE_EEvS4_RKT_EUliE_EEviT1_
├─ nan 0.027 _ZN2at6native54_GLOBAL__N__a236ace4_21_DistributionNormal_cu_0c5b6e8543distribution_elementwise_grid_stride_kernelIfLi4EZNS0_9templates4cuda20normal_and_transformIN3c104HalfEfLm4EPNS_17CUDAGeneratorImplEZZZNS4_13normal_kernelIS9_EEvRKNS_10TensorBaseEddT_ENKUlvE_clEvENKUlvE1_clEvEUlfE_EEvRNS_18TensorIteratorBaseET2_T3_EUlP24curandStatePhilox4_32_10E0_ZNS1_27distribution_nullary_kernelIS7_fLi4ES9_SO_SH_EEvSJ_SK_RKSL_T4_EUlifE_EEviNS_15PhiloxCudaStateET1_SK_
├─ 285.498 2647.706 cublas [M=8192, N=8192, K=512]
│  └─ nan 2647.706 sm90_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_off_kernel__5x_cublas
├─ 217.884 315.394 matmul_kernel [M=8192, N=8192, K=512]
├─ 262.534 261.755 matmul_kernel_descriptor_persistent [M=8192, N=8192, K=512]
├─ 246.617 278.649 matmul_kernel_persistent [M=8192, N=8192, K=512]
├─ 262.525 261.764 matmul_kernel_tma_persistent [M=8192, N=8192, K=512]
└─ 295.007 232.942 torch [M=8192, N=8192, K=512]
   └─ nan 232.942 sm90_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_off_kernel__5x_cublas

Legend (Metric: tflop16/s (inc) Min: 217.88 Max: 295.01)
█ 287.29 - 295.01
█ 271.87 - 287.29
█ 256.45 - 271.87
█ 241.02 - 256.45
█ 225.60 - 241.02
█ 217.88 - 225.60

name User code    ◀  Only in left graph    ▶  Only in right graph

```

[Blackwell] Hoist constant TMem allocation out of the loop (#5857)

For cases where TMem is constant hoisting the allocation of the loop
avoid having to store multiple times.

[AMD] Use LLVM ops for fp16<->fp32 casts (#5859)

Inline assembly can be a blocker for LLVM backend to optimize.

[BW][PIPELINE] Add an option to tl.range to disallow accumulator multi-buffering (#5858)

Rework mmav5 pipelining to allow pipelining of mma when multibuffering
of the accumulator is impossible by putting uses in the same stage as
the mma and blocking on wait until current mma finishes.
Based on this support, introducing new flag to `tl.range` that controls
if multibuffering of the accumulator of the dots in the loop is allowed.
Without the mentioned rework of mmav5 pipelining we would simply not
pipeline cases where mutibuffering is disallowed.

[AMD] NFC: Refactor DotOpMFMAConversionHelper (#5862)

This PR refactored `DotOpMFMAConversionHelper` by extracting utility
functions from `convertDot` to make it easier to be extended in
#5845.

[AMD][CI] Switch to rocm hosted docker image (#5855)

The CI docker image has been moved from `rocmshared/pytorch` to
`rocm/pytorch` so we need to adjust our workflow.

[triton][NFC] Refactor build function of ScanOp (#5861)

It's better to use ScanOp's build function to replace ReduceOp's.

<!---
The core Triton is a small number of people, and we receive many PRs
(thank
you!).  To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them.
-->

- [x] I am not making a trivial change, such as fixing a typo in a
comment.

- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [ ] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
  - [x] This PR does not need a test because `FILL THIS IN`.

- Select one of the following.
  - [x] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)
…ed (#5812)

This PR enables MMA pipelining for scaled dot.

The main difficulty this PR overcomes is the dependency cycle between
TMEM copy rewriting and SWP - currently TMEM copy rewriting relies on
SWP to put loading of scales into SMEM, while to apply MMA pipelining
during SWP, TMEM copy rewriting needs to have happened beforehand. I
propose to break the cycle by having loading of scales go through
`local_alloc` and `local_load` in `AccelerateMatmul`. This way, TMEM
copy rewriting happens during [the first call to
OptimizedDotOperands,](https://github.com/triton-lang/triton/blob/1e0e51c4aeb3e1beea000da5d0e494f8b9ac40dd/third_party/nvidia/backend/compiler.py#L260)
before SWP. And the local alloc and load added in `AccelerateMatmul` are
eliminated during SWP. It's a bit ad hoc to add local alloc for scales
there, since scales do not need to be in SMEM. But other solutions, like
decoupling MMA pipelining from SWP, is more difficult.

The other changes in this PR are for making SWP recognize loading of
scales when there is TMEM copy between scale load and MMA.

@ThomasRaoux @pawelszczerbuk @csullivan @mbrookhart @binarybana

---------

Co-authored-by: Masahiro Masuda <[email protected]>
Co-authored-by: Jason Knight <[email protected]>
@Jokeren Jokeren force-pushed the keren/proton-dev-gpu branch from 04b6c94 to f71c805 Compare February 10, 2025 16:33
pawelszczerbuk and others added 2 commits February 10, 2025 11:33
…5810)

Pipeline wgmma operands in registers always if they are not suitable for
shmem pipelining.
This introduces a performance regression on attention workloads, but
fixes a crash introduced by
#5798.
Make sure we allocate the right number of slices when doing mmav3
pipelining.
@Jokeren Jokeren closed this Feb 10, 2025
@Jokeren Jokeren deleted the keren/proton-dev-gpu branch February 10, 2025 16:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants