Skip to content

Conversation

gshtras
Copy link
Collaborator

@gshtras gshtras commented May 15, 2025

Revert to the previous version of the triton attention kernel, modified to support FP8 computation.
The kernel brought in in #12591 turned out to have performance issues, and broken support for FP8 quantized models.
Until that is resolved we want to replace it from the performant version from the ROCm fork

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@gshtras gshtras added the rocm Related to AMD ROCm label May 16, 2025
@fxmarty-amd
Copy link
Contributor

fxmarty-amd commented May 19, 2025

+1 @gshtras, I noticed that prefill takes a very long time (several seconds for a very short sequence), comparing e3d0a1d and 6aae216 (just before #15734 and #12591). As rerunning the same sequence is sufficiently fast before #15734 and #12591, I suppose triton autotuning is running after these triton updates. This is prohibitive when running lm-eval-harness, for example, or in real world scenario.

I thought cuda graphs (if padding is applied to captured shapes) would help, but it seems not to.

This is a bit worrying as #12591 is included in 0.9.0 and vllm default is now very slow by default on CDNA3 platforms. Reverting for now sounds good

Besides, between #15734 and #12591, the Triton FA in ROCmBackend code path is broken as full_scales is passed to triton_attention in #15734, but the argument is not added to https://github.com/rasmith/vllm/blob/9229d9aa11515dc037db8d3c4eae691a751d23ac/vllm/attention/ops/triton_flash_attention.py#L701-L714 (as spotted in #17235)

Is there a ROCm CI and/or performance tracking that we can follow for regressions like this?

@fxmarty-amd
Copy link
Contributor

cc @mgoin FYI.

@mgoin
Copy link
Member

mgoin commented May 21, 2025

cc @SageMoore @ProExpertProg as I don't have context on the original change

Copy link
Collaborator

@ProExpertProg ProExpertProg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we extract the changes here into a different file? That way the fixes to the kernel currently on main can happen in parallel

@ProExpertProg
Copy link
Collaborator

ProExpertProg commented May 21, 2025

Besides, between #15734 and #12591, the Triton FA in ROCmBackend code path is broken (as spotted in #17235)

Yes, these were supposed to be merged in the opposite order, so there are a few commits between them where the codepath is broken

@gshtras
Copy link
Collaborator Author

gshtras commented May 22, 2025

@ProExpertProg
To summarize the offline discussion, the plan is to have this proposed kernel fixed until the end of V0 lifespan, as it mirrors the one that's proven to be performant in the ROCm fork, so no further changes are planned here.
Also it does support the FP8 fusion, I looked at the wrong part during the discussion :)

Copy link
Collaborator

@ProExpertProg ProExpertProg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, this is a temporary V0 kernel, as V0 is getting deprecated soon anyway.

@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label May 27, 2025
@mgoin mgoin merged commit 1b7cfd5 into vllm-project:main May 29, 2025
64 checks passed
@gshtras gshtras deleted the revert_triton_kernel branch May 29, 2025 19:35
amitm02 pushed a commit to amitm02/vllm that referenced this pull request Jun 1, 2025
amitm02 pushed a commit to amitm02/vllm that referenced this pull request Jun 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants