[Performance][ROCm] Add skinny gemms for unquantized linear on ROCm #15830

charlifu · 2025-03-31T17:21:13Z

This PR adds skinny gemms for unquantized linear (bf16 and fp16) op on ROCm to achieve better performance when the batch size is <= 2.

	bs=2,in=32,out=128,tp=8	bs=1,in=32,out=128,tp=8
VLLM_USE_ROCM_SKINNY_GEMM=0	1.4460096444313726 seconds	1.4118467693217098 seconds
VLLM_USE_ROCM_SKINNY_GEMM=1	1.3444539802459379 seconds	1.2891838241368532 seconds

Llama 3.1 70b, fp16

Signed-off-by: charlifu <[email protected]>

github-actions · 2025-03-31T17:21:22Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: charlifu <[email protected]>

DarkLight1337 · 2025-04-01T06:48:11Z

Can you merge from main to fix the Docker build issue?

Signed-off-by: charlifu <[email protected]>

SageMoore

Just a first pass. I haven't gone through the kernel. Can you add some unit tests that will exercise this kernel?

SageMoore · 2025-04-01T15:46:57Z

vllm/envs.py

             ("true", "1")),

+    # use rocm skinny gemms
+    "VLLM_ROCM_USE_SKINNY_GEMM":


I'm somewhat hesitant to have this on by default. It looks like it only gives modest gains in low batch scenarios? Is this generally true or just for llama?

We discussed this a bit offline. Since this has been on by default in the rocm fork, which is deployed in customer environments, for some time, I'm fine with having it on by default.

Signed-off-by: charlifu <[email protected]>

mergify · 2025-04-08T02:20:22Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @charlifu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: charlifu <[email protected]>

charlifu · 2025-04-08T15:27:12Z

@SageMoore Unit test added. Let me you know you wanna add more shapes to test. BTW, for the dispatch logic to select different gemm method, I am copying #14916. We might need to merge aiter code path with this PR.

Signed-off-by: charlifu <[email protected]>

tlrmchlsmth · 2025-04-18T17:58:21Z

vllm/model_executor/layers/quantization/utils/w8a8_utils.py

+    # fp8 rowwise scaling in torch._scaled_mm is introduced in
+    # https://github.com/pytorch/pytorch/pull/144432 using
+    # hipBLASLt and ROCm 6.3, which only exists in torch 2.7 and above.
+    # For CUDA platform please validate if the
+    # torch._scaled_mm support rowwise scaled GEMM
+    # Fused GEMM_DQ Rowwise GEMM


I see this is already in vLLM so not a problem with this PR in particular but why are landing comments like this one?

Who is the audience for this message? Who is supposed to "please validate if torch._scaled_mm support rowwise scaled GEMM" on CUDA? The user?

vllm/model_executor/layers/quantization/utils/w8a8_utils.py

mergify · 2025-04-21T15:31:45Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @charlifu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Co-authored-by: Tyler Michael Smith <[email protected]> Signed-off-by: charlifu <[email protected]>

Signed-off-by: charlifu <[email protected]>

tlrmchlsmth

PR looks good to me.

Kernel test timeout is a known issue, and could be force merged. I want to make sure the other test failures are not problems with this PR, so have re-run them. (The lm-eval-small-models test failure is suspicious but it passes locally for me using this branch)

…llm-project#15830) Signed-off-by: charlifu <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Signed-off-by: Frieda (Jingying) Huang <[email protected]>

…llm-project#15830) Signed-off-by: charlifu <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]>

…llm-project#15830) Signed-off-by: charlifu <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Signed-off-by: Agata Dobrzyniewicz <[email protected]>

…llm-project#15830) Signed-off-by: charlifu <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Signed-off-by: Mu Huai <[email protected]>

charlifu added 11 commits March 24, 2025 23:36

add apply_linear_rocm

2c6fdc0

Signed-off-by: charlifu <[email protected]>

Merge branch 'main' into charlifu/amd_skinny_gemm

ae5e386

add skinny gemm for fp16

f6784a6

Signed-off-by: charlifu <[email protected]>

use wvSplitK

0993ea0

Signed-off-by: charlifu <[email protected]>

add env for skinny gemm

6dfdd5f

Signed-off-by: charlifu <[email protected]>

add bf16 support for llmm1

9aa2059

Signed-off-by: charlifu <[email protected]>

update skinny gemms

16fb48c

Signed-off-by: charlifu <[email protected]>

Merge branch 'main' into charlifu/amd_skinny_gemm

f6cfce5

Signed-off-by: charlifu <[email protected]>

add bf16 wvsplitK

e06862e

Signed-off-by: charlifu <[email protected]>

clean up

5c60d0b

Signed-off-by: charlifu <[email protected]>

Merge branch 'main' into charlifu/amd_skinny_gemm

ff65f9a

charlifu requested a review from tlrmchlsmth as a code owner March 31, 2025 17:21

mergify bot added the ci/build label Mar 31, 2025

add n == 3 case

c017ce1

Signed-off-by: charlifu <[email protected]>

charlifu added 2 commits April 1, 2025 14:48

Merge branch 'main' into charlifu/amd_skinny_gemm

534eaeb

disable fp8 gemm padding for rocm

76f8172

Signed-off-by: charlifu <[email protected]>

charlifu requested review from mgoin and robertgshaw2-redhat as code owners April 1, 2025 14:51

SageMoore suggested changes Apr 1, 2025

View reviewed changes

add wvsplitK fp8 and unit tests

91205a4

Signed-off-by: charlifu <[email protected]>

charlifu requested a review from WoosukKwon as a code owner April 8, 2025 02:19

mergify bot added the needs-rebase label Apr 8, 2025

Merge branch 'main' into charlifu/amd_skinny_gemm

0b6e71b

Signed-off-by: charlifu <[email protected]>

mergify bot removed the needs-rebase label Apr 8, 2025

fix fp8 skinny gemm call

63efd7f

Signed-off-by: charlifu <[email protected]>

fix engine test

5a09506

Signed-off-by: charlifu <[email protected]>

mergify bot removed the needs-rebase label Apr 10, 2025

charlifu added 2 commits April 14, 2025 14:36

Merge branch 'main' into charlifu/amd_skinny_gemm

afc4880

remove cache decorator to fix V1 error

c8c248b

Signed-off-by: charlifu <[email protected]>

robertgshaw2-redhat approved these changes Apr 17, 2025

View reviewed changes

robertgshaw2-redhat enabled auto-merge (squash) April 17, 2025 12:48

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 17, 2025

Merge branch 'main' into charlifu/amd_skinny_gemm

176b754

tlrmchlsmth reviewed Apr 18, 2025

View reviewed changes

vllm/model_executor/layers/quantization/utils/w8a8_utils.py Outdated Show resolved Hide resolved

auto-merge was automatically disabled April 21, 2025 15:31
Head branch was pushed to by a user without write access

mergify bot added the needs-rebase label Apr 21, 2025

Update vllm/model_executor/layers/quantization/utils/w8a8_utils.py

6535863

Co-authored-by: Tyler Michael Smith <[email protected]> Signed-off-by: charlifu <[email protected]>

charlifu force-pushed the charlifu/amd_skinny_gemm branch from 78b854e to 6535863 Compare April 21, 2025 15:33

Merge branch 'main' into charlifu/amd_skinny_gemm

26f9233

Signed-off-by: charlifu <[email protected]>

mergify bot removed the needs-rebase label Apr 21, 2025

tlrmchlsmth approved these changes Apr 21, 2025

View reviewed changes

vllm-bot merged commit 188b7f9 into vllm-project:main Apr 22, 2025
66 of 69 checks passed

tjtanaa mentioned this pull request Apr 22, 2025

[FEAT] [ROCm]: Support AITER Linear #14916

Open

lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025

[Performance][ROCm] Add skinny gemms for unquantized linear on ROCm (v…

c244108

…llm-project#15830) Signed-off-by: charlifu <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]>

ckhordiasma mentioned this pull request May 14, 2025

nm vllm ent 0.8.5 sync red-hat-data-services/vllm#139

Merged

charlifu deleted the charlifu/amd_skinny_gemm branch June 4, 2025 16:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Performance][ROCm] Add skinny gemms for unquantized linear on ROCm #15830

[Performance][ROCm] Add skinny gemms for unquantized linear on ROCm #15830

Uh oh!

charlifu commented Mar 31, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Mar 31, 2025

Uh oh!

DarkLight1337 commented Apr 1, 2025

Uh oh!

SageMoore left a comment

Uh oh!

SageMoore Apr 1, 2025

Uh oh!

SageMoore Apr 3, 2025

Uh oh!

mergify bot commented Apr 8, 2025

Uh oh!

charlifu commented Apr 8, 2025 •

edited

Loading

Uh oh!

tlrmchlsmth Apr 18, 2025

Uh oh!

Uh oh!

mergify bot commented Apr 21, 2025

Uh oh!

tlrmchlsmth left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[Performance][ROCm] Add skinny gemms for unquantized linear on ROCm #15830

[Performance][ROCm] Add skinny gemms for unquantized linear on ROCm #15830

Uh oh!

Conversation

charlifu commented Mar 31, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 31, 2025

Uh oh!

DarkLight1337 commented Apr 1, 2025

Uh oh!

SageMoore left a comment

Choose a reason for hiding this comment

Uh oh!

SageMoore Apr 1, 2025

Choose a reason for hiding this comment

Uh oh!

SageMoore Apr 3, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Apr 8, 2025

Uh oh!

charlifu commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tlrmchlsmth Apr 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mergify bot commented Apr 21, 2025

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

charlifu commented Mar 31, 2025 •

edited by github-actions bot

Loading

charlifu commented Apr 8, 2025 •

edited

Loading