-
Notifications
You must be signed in to change notification settings - Fork 645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LLVMGPUVectorDistribute] Re-arrange nested layouts for better conversions #19437
base: main
Are you sure you want to change the base?
[LLVMGPUVectorDistribute] Re-arrange nested layouts for better conversions #19437
Conversation
conversions Currently, in the layout conversions with shared_memory_conversion we directly read in the destination layout. However, if the maximum vector length is not present in the element tile, it would generate sub-optimal shared memory reads. This commit introduces a new layout for read such that shared memory reads are better vectorized.
int64_t &batchTileLen = batchTile.back(); | ||
int64_t &outerTileLen = outerTile.back(); | ||
// TODO: maybe we should obtain this from somewhere ? | ||
constexpr int64_t maxVecLenBits = 128; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to obtain arch specific information in a pass ?
@qedawkins @Groverkss @raikonenfnu I found a simpler solution to bad lds read problem. I actually did not need to transpose because reduce bank conflict pass in conjuction with the technique here was able to get vectorized non bank conflicting shared memory loads There are two main changes here, which I can potentially break up into two PRs if necessary. |
Very cool! I will have a look later today. |
I need to add tests; Holding that off until I get thumbs up on the approach here. |
@@ -906,17 +903,19 @@ struct DistributeBatchOuterToLayoutConversions final | |||
SmallVector<int64_t> shapeB = layoutB.getDistributedShape(); | |||
int64_t rank = layoutA.getRank(); | |||
|
|||
// Interleave batch and outer dims by transposing. | |||
// Interleave in-thread elements by transposing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok this is generally not right.. thanks @Groverkss point it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works currently because the producer of this is a transfer_read
Im doing v2 of this in the lines of making distribution of |
Currently, in the layout conversions with shared_memory_conversion we directly read in the destination layout. However, if the maximum vector length is not present in the element tile, it would generate sub-optimal shared memory reads.
This commit introduces a new layout for read such that shared memory reads are better vectorized.
[Maybe I can break this into two PRs] it also adds the ability to distribute thread tile when it differs.
After this change, I dont see a perf difference between transpose V attention vs normal attention.
Moreover, it improves performance for both cases.
Obviously, the improvement is much larger for the latter.