-
-
Notifications
You must be signed in to change notification settings - Fork 10.4k
[ROCm][V0][Attention] Revert to the previous FA triton kernel #18226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Gregory Shtrasberg <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
+1 @gshtras, I noticed that prefill takes a very long time (several seconds for a very short sequence), comparing e3d0a1d and 6aae216 (just before #15734 and #12591). As rerunning the same sequence is sufficiently fast before #15734 and #12591, I suppose triton autotuning is running after these triton updates. This is prohibitive when running lm-eval-harness, for example, or in real world scenario. I thought cuda graphs (if padding is applied to captured shapes) would help, but it seems not to. This is a bit worrying as #12591 is included in 0.9.0 and vllm default is now very slow by default on CDNA3 platforms. Reverting for now sounds good Besides, between #15734 and #12591, the Triton FA in ROCmBackend code path is broken as Is there a ROCm CI and/or performance tracking that we can follow for regressions like this? |
cc @mgoin FYI. |
cc @SageMoore @ProExpertProg as I don't have context on the original change |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we extract the changes here into a different file? That way the fixes to the kernel currently on main
can happen in parallel
@ProExpertProg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, this is a temporary V0 kernel, as V0 is getting deprecated soon anyway.
…roject#18226) Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: amit <[email protected]>
…roject#18226) Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: amit <[email protected]>
Revert to the previous version of the triton attention kernel, modified to support FP8 computation.
The kernel brought in in #12591 turned out to have performance issues, and broken support for FP8 quantized models.
Until that is resolved we want to replace it from the performant version from the ROCm fork