Support dynamic masks in splash attention #25213

Rifur13 · 2024-12-02T20:24:49Z

Adding support for dynamic masks in the splash attention kernel.

Currently, splash attention expects a static mask. It's preprocessed, and only the interesting (not fully masked) parts of the mask are passed to the kernel. This change allows users to pass in a jax.Array instead. Since we can’t know the number of partial mask blocks at trace time, the entire mask is materialized in partial_mask_blocks.

tests/pallas/tpu_splash_attention_mask_test.py

apghml · 2024-12-02T20:53:14Z

jax/experimental/pallas/ops/tpu/splash_attention/splash_attention_mask_info.py

+          data_next=data_next,
+          mask_next=mask_next,
+          block_mask=block_mask,
+          partial_mask_blocks=partial_mask_blocks,


Does partial_mask_blocks get stored in scalar memory? What happens if we exceed the size of scalar memory? More generally, how would this handle e.g., a 100k+ length context where partial_mask_blocks is quite large. (Since we can't do deduplication for dynamic arrays.)

Some possible things that might help:

Support int8 for partial_mask_blocks since its entries are all 0/1, or maybe even packed bool values.

Have some sort of sharding support for partial_mask_blocks.

Probably the most "comprehensive" solution would be to support ComputableMask (or perhaps another sibling class) having jax.tree_util.Partial() as the callable with the ability to specify sharding information for the arrays stored in the partial object. This would even allow users to implement support for the first bullet point themselves, avoiding the complexity of supporting it in the kernel directly.

partial_mask_blocks is in HBM, scalar memory is tiny on TPUs. Using mask_next, the right mask block is prefetched into VMEM for each kernel invocation. You still need to fit partial_mask_blocks in HBM, so sharding and using smaller data types help here.

We're running into a known edge case of mosaic here by using int8/bools, but people are working on it. One workaround we can do for now is to use smaller data types and upcast to int32 later.

Thanks for the comments, let me think some more about sharding and get back to you.

I just wanted to give a gentle ping about this.

Thanks for your patience. I’ll add sharding support in a follow-up PR to unblock you for now. Specifically, sharding partial_mask_blocks and the MaskInfo. Stay tuned!

apghml · 2024-12-02T20:53:23Z

Thanks a lot!

sharadmv

Very nice!

sharadmv · 2025-01-13T22:07:40Z

jax/experimental/pallas/ops/tpu/splash_attention/splash_attention_mask_info.py

+
+  mask_next = jnp.where(
+      jnp.logical_or(is_empty_mask, is_full_mask),
+      0,


Leave TODO/comment explaining choice of 0

apghml · 2025-01-16T18:16:31Z

jax/experimental/pallas/ops/tpu/splash_attention/splash_attention_mask_info.py

+      .swapaxes(-1, -2)
+      .reshape(*block_mask_shape, kv_block_size, q_block_size)
+      .swapaxes(-1, -2)
+      .astype(np.int32)


It looks like this needs to be updated to bool for jax 0.4.39 compatibility? I'm not sure if any other changes might be needed too.
Also, would the existing tests catch this dtype issue if they were rerun today?

Rebased. The tests did indeed catch this.

jax/experimental/pallas/ops/tpu/splash_attention/splash_attention_mask_info.py

apghml · 2025-01-16T22:10:03Z

jax/experimental/pallas/ops/tpu/splash_attention/splash_attention_mask_info.py

+
+  if downcast_smem_data:
+    block_mask = block_mask.astype(np.int8)  # values are in the range [0, 1, 2]
+    data_next = _downcast(data_next, kv_seq_len if is_dkv else q_seq_len)


Should this be kv_blocks_count and q_blocks_count respectively?

Good catch, done.

apghml · 2025-01-16T22:10:48Z

jax/experimental/pallas/ops/tpu/splash_attention/splash_attention_mask_info.py

+          data_next=data_next,
+          mask_next=mask_next,
+          block_mask=block_mask,
+          partial_mask_blocks=partial_mask_blocks,


I just wanted to give a gentle ping about this.

Rifur13 requested a review from sharadmv December 2, 2024 20:30

jakevdp assigned sharadmv Dec 2, 2024

apghml reviewed Dec 2, 2024

View reviewed changes

Rifur13 force-pushed the dynamic_mask branch from ada8002 to 02338e2 Compare December 10, 2024 01:03

sharadmv approved these changes Jan 13, 2025

View reviewed changes

google-ml-butler bot added kokoro:force-run pull ready Ready for copybara import and testing labels Jan 13, 2025

apghml reviewed Jan 16, 2025

View reviewed changes

Rifur13 force-pushed the dynamic_mask branch from 02338e2 to a3b1e19 Compare January 27, 2025 21:28

kokoro-team removed the kokoro:force-run label Jan 27, 2025

Support dynamic masks in splash attention

c0d23af

Rifur13 force-pushed the dynamic_mask branch from a3b1e19 to c0d23af Compare January 28, 2025 00:15

copybara-service bot merged commit bc130c7 into jax-ml:main Jan 28, 2025
19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support dynamic masks in splash attention #25213

Support dynamic masks in splash attention #25213

Rifur13 commented Dec 2, 2024

apghml Dec 2, 2024 •

edited

Loading

Rifur13 Dec 4, 2024

apghml Jan 16, 2025

Rifur13 Jan 27, 2025

apghml commented Dec 2, 2024

sharadmv left a comment

sharadmv Jan 13, 2025

Rifur13 Jan 27, 2025

apghml Jan 16, 2025

Rifur13 Jan 27, 2025

apghml Jan 16, 2025

Rifur13 Jan 27, 2025

apghml Jan 16, 2025

Support dynamic masks in splash attention #25213

Support dynamic masks in splash attention #25213

Conversation

Rifur13 commented Dec 2, 2024

apghml Dec 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apghml commented Dec 2, 2024

sharadmv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apghml Dec 2, 2024 •

edited

Loading