Skip to content

Conversation

charlifu
Copy link
Contributor

@charlifu charlifu commented Mar 31, 2025

This PR adds skinny gemms for unquantized linear (bf16 and fp16) op on ROCm to achieve better performance when the batch size is <= 2.

bs=2,in=32,out=128,tp=8 bs=1,in=32,out=128,tp=8
VLLM_USE_ROCM_SKINNY_GEMM=0 1.4460096444313726 seconds 1.4118467693217098 seconds
VLLM_USE_ROCM_SKINNY_GEMM=1 1.3444539802459379 seconds 1.2891838241368532 seconds

Llama 3.1 70b, fp16

@charlifu charlifu requested a review from tlrmchlsmth as a code owner March 31, 2025 17:21
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the ci/build label Mar 31, 2025
Signed-off-by: charlifu <[email protected]>
@DarkLight1337
Copy link
Member

Can you merge from main to fix the Docker build issue?

Copy link
Contributor

@SageMoore SageMoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a first pass. I haven't gone through the kernel. Can you add some unit tests that will exercise this kernel?

("true", "1")),

# use rocm skinny gemms
"VLLM_ROCM_USE_SKINNY_GEMM":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm somewhat hesitant to have this on by default. It looks like it only gives modest gains in low batch scenarios? Is this generally true or just for llama?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this a bit offline. Since this has been on by default in the rocm fork, which is deployed in customer environments, for some time, I'm fine with having it on by default.

@charlifu charlifu requested a review from WoosukKwon as a code owner April 8, 2025 02:19
Copy link

mergify bot commented Apr 8, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @charlifu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Apr 8, 2025
@mergify mergify bot removed the needs-rebase label Apr 8, 2025
@charlifu
Copy link
Contributor Author

charlifu commented Apr 8, 2025

@SageMoore Unit test added. Let me you know you wanna add more shapes to test. BTW, for the dispatch logic to select different gemm method, I am copying #14916. We might need to merge aiter code path with this PR.

Signed-off-by: charlifu <[email protected]>
@mergify mergify bot removed the needs-rebase label Apr 10, 2025
@robertgshaw2-redhat robertgshaw2-redhat enabled auto-merge (squash) April 17, 2025 12:48
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 17, 2025
Comment on lines 202 to 207
# fp8 rowwise scaling in torch._scaled_mm is introduced in
# https://github.com/pytorch/pytorch/pull/144432 using
# hipBLASLt and ROCm 6.3, which only exists in torch 2.7 and above.
# For CUDA platform please validate if the
# torch._scaled_mm support rowwise scaled GEMM
# Fused GEMM_DQ Rowwise GEMM
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this is already in vLLM so not a problem with this PR in particular but why are landing comments like this one?

Who is the audience for this message? Who is supposed to "please validate if torch._scaled_mm support rowwise scaled GEMM" on CUDA? The user?

auto-merge was automatically disabled April 21, 2025 15:31

Head branch was pushed to by a user without write access

Copy link

mergify bot commented Apr 21, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @charlifu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Apr 21, 2025
@charlifu charlifu force-pushed the charlifu/amd_skinny_gemm branch from 78b854e to 6535863 Compare April 21, 2025 15:33
@mergify mergify bot removed the needs-rebase label Apr 21, 2025
Copy link
Member

@tlrmchlsmth tlrmchlsmth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR looks good to me.

Kernel test timeout is a known issue, and could be force merged. I want to make sure the other test failures are not problems with this PR, so have re-run them. (The lm-eval-small-models test failure is suspicious but it passes locally for me using this branch)

@vllm-bot vllm-bot merged commit 188b7f9 into vllm-project:main Apr 22, 2025
66 of 69 checks passed
frieda-huang pushed a commit to frieda-huang/vllm that referenced this pull request Apr 23, 2025
…llm-project#15830)

Signed-off-by: charlifu <[email protected]>
Co-authored-by: Tyler Michael Smith <[email protected]>
Signed-off-by: Frieda (Jingying) Huang <[email protected]>
jikunshang pushed a commit to jikunshang/vllm that referenced this pull request Apr 29, 2025
lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025
adobrzyn pushed a commit to HabanaAI/vllm-fork that referenced this pull request Apr 30, 2025
…llm-project#15830)

Signed-off-by: charlifu <[email protected]>
Co-authored-by: Tyler Michael Smith <[email protected]>
Signed-off-by: Agata Dobrzyniewicz <[email protected]>
RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025
@charlifu charlifu deleted the charlifu/amd_skinny_gemm branch June 4, 2025 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants