[ROCm][V0][Attention] Revert to the previous FA triton kernel #18226

gshtras · 2025-05-15T21:10:58Z

Revert to the previous version of the triton attention kernel, modified to support FP8 computation.
The kernel brought in in #12591 turned out to have performance issues, and broken support for FP8 quantized models.
Until that is resolved we want to replace it from the performant version from the ROCm fork

Signed-off-by: Gregory Shtrasberg <[email protected]>

github-actions · 2025-05-15T21:11:08Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

fxmarty-amd · 2025-05-19T16:35:08Z

+1 @gshtras, I noticed that prefill takes a very long time (several seconds for a very short sequence), comparing e3d0a1d and 6aae216 (just before #15734 and #12591). As rerunning the same sequence is sufficiently fast before #15734 and #12591, I suppose triton autotuning is running after these triton updates. This is prohibitive when running lm-eval-harness, for example, or in real world scenario.

I thought cuda graphs (if padding is applied to captured shapes) would help, but it seems not to.

This is a bit worrying as #12591 is included in 0.9.0 and vllm default is now very slow by default on CDNA3 platforms. Reverting for now sounds good

Besides, between #15734 and #12591, the Triton FA in ROCmBackend code path is broken as full_scales is passed to triton_attention in #15734, but the argument is not added to https://github.com/rasmith/vllm/blob/9229d9aa11515dc037db8d3c4eae691a751d23ac/vllm/attention/ops/triton_flash_attention.py#L701-L714 (as spotted in #17235)

Is there a ROCm CI and/or performance tracking that we can follow for regressions like this?

fxmarty-amd · 2025-05-21T16:17:52Z

cc @mgoin FYI.

mgoin · 2025-05-21T19:07:19Z

cc @SageMoore @ProExpertProg as I don't have context on the original change

ProExpertProg

Could we extract the changes here into a different file? That way the fixes to the kernel currently on main can happen in parallel

ProExpertProg · 2025-05-21T19:12:30Z

Besides, between #15734 and #12591, the Triton FA in ROCmBackend code path is broken (as spotted in #17235)

Yes, these were supposed to be merged in the opposite order, so there are a few commits between them where the codepath is broken

gshtras · 2025-05-22T18:26:01Z

@ProExpertProg
To summarize the offline discussion, the plan is to have this proposed kernel fixed until the end of V0 lifespan, as it mirrors the one that's proven to be performant in the ROCm fork, so no further changes are planned here.
Also it does support the FP8 fusion, I looked at the wrong part during the discussion :)

ProExpertProg

As discussed, this is a temporary V0 kernel, as V0 is getting deprecated soon anyway.

…roject#18226) Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: amit <[email protected]>

Revert to the previous FA triton kernel

75a43a8

Signed-off-by: Gregory Shtrasberg <[email protected]>

gshtras added the rocm Related to AMD ROCm label May 16, 2025

ProExpertProg reviewed May 21, 2025

View reviewed changes

ProExpertProg approved these changes May 27, 2025

View reviewed changes

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label May 27, 2025

Merge remote-tracking branch 'origin/main' into revert_triton_kernel

1e7d938

mgoin approved these changes May 29, 2025

View reviewed changes

mgoin merged commit 1b7cfd5 into vllm-project:main May 29, 2025
64 checks passed

gshtras deleted the revert_triton_kernel branch May 29, 2025 19:35

amitm02 pushed a commit to amitm02/vllm that referenced this pull request Jun 1, 2025

[ROCm][V0][Attention] Revert to the previous FA triton kernel (vllm-p…

1fe03ee

…roject#18226) Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: amit <[email protected]>

amitm02 pushed a commit to amitm02/vllm that referenced this pull request Jun 1, 2025

[ROCm][V0][Attention] Revert to the previous FA triton kernel (vllm-p…

7cdd1be

…roject#18226) Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: amit <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[ROCm][V0][Attention] Revert to the previous FA triton kernel #18226

[ROCm][V0][Attention] Revert to the previous FA triton kernel #18226

Uh oh!

gshtras commented May 15, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented May 15, 2025

Uh oh!

fxmarty-amd commented May 19, 2025 •

edited

Loading

Uh oh!

fxmarty-amd commented May 21, 2025

Uh oh!

mgoin commented May 21, 2025

Uh oh!

ProExpertProg left a comment

Uh oh!

ProExpertProg commented May 21, 2025 •

edited

Loading

Uh oh!

gshtras commented May 22, 2025

Uh oh!

ProExpertProg left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[ROCm][V0][Attention] Revert to the previous FA triton kernel #18226

[ROCm][V0][Attention] Revert to the previous FA triton kernel #18226

Uh oh!

Conversation

gshtras commented May 15, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 15, 2025

Uh oh!

fxmarty-amd commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fxmarty-amd commented May 21, 2025

Uh oh!

mgoin commented May 21, 2025

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

ProExpertProg commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gshtras commented May 22, 2025

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gshtras commented May 15, 2025 •

edited by github-actions bot

Loading

fxmarty-amd commented May 19, 2025 •

edited

Loading

ProExpertProg commented May 21, 2025 •

edited

Loading