[Model] Switch to Fused RMS norm in Qwen2.5_VL model. #22184

vllmellm · 2025-08-04T08:10:14Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

This update introduces support for fused RMSNorm in Qwen2.5 vision language model.

Switching to a fused RMS normalization implementation simplifies the computation trace and enables further optimizations for greater performance improvements in the future. These optimizations may include increased computational efficiency and reduced memory usage by eliminating unnecessary intermediate tensors, decreasing global memory traffic, and improving GPU utilization.

Importantly, these changes do not negatively impact model performance in terms of accuracy or speed.

Benchmark results on Qwen/Qwen2.5-VL-7B-Instruct

Metric	before changes	after changes
Successful requests	1000	1000
Benchmark duration (s)	440.27	440.89
Total input tokens	94327	94327
Total generated tokens	114264	114095
Request throughput (req/s)	2.27	2.27
Output token throughput (tok/s)	259.53	258.78
Total token throughput (tok/s)	473.78	472.73
Mean TTFT (ms)	194967.43	195009.28
Median TTFT (ms)	168388.42	164453.54
P99 TTFT (ms)	428668.77	431252.04
Mean TPOT (ms)	2181.98	2196.16
Median TPOT (ms)	2288.50	2300.13
P99 TPOT (ms)	4544.71	4581.73
Mean ITL (ms)	2009.94	2021.27
Median ITL (ms)	63.95	67.49
P99 ITL (ms)	8188.98	5201.31

Test Plan

Use mistral-evals repo for accuracy evaluation.

step1:

python3 -m eval.run eval_vllm --model_name Qwen/Qwen2.5-VL-7B-Instruct --url http://0.0.0.0:8000

step2:

VLLM_ROCM_USE_AITER=1 \ SAFETENSORS_FAST_GPU=1 \ MIOPEN_FIND_MODE=FAST \ vllm serve Qwen/Qwen2.5-VL-7B-Instruct --trust_remote_code -tp 4

Test Results

eval result on Qwen/Qwen2.5-VL-7B-Instruct model

before changes:

{
"explicit_prompt_relaxed_correctness": 0.8644,
"anywhere_in_answer_relaxed_correctness": 0.8644
}

after changes:

{
"explicit_prompt_relaxed_correctness": 0.8644,
"anywhere_in_answer_relaxed_correctness": 0.8644
}

Signed-off-by: kf <[email protected]> Signed-off-by: tjtanaavllm <[email protected]> Signed-off-by: vllmellm <[email protected]>

Signed-off-by: vllmellm <[email protected]>

gemini-code-assist

Code Review

This pull request introduces a performance optimization in the Qwen2_5_VisionBlock by switching to a fused RMS normalization kernel. The change correctly refactors the forward pass to use the fused fused_add_rms_norm operation, which combines the residual addition and layer normalization steps. This is a good optimization that maintains the original logic while potentially improving performance by reducing kernel launch overhead and memory traffic. The implementation is clean, correct, and well-contained within the vision block.

github-actions · 2025-08-04T08:53:21Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

DarkLight1337

LGTM

tjtanaa · 2025-08-07T03:41:06Z

CC. @wuhuikx

…2184) Signed-off-by: kf <[email protected]> Signed-off-by: tjtanaavllm <[email protected]> Signed-off-by: vllmellm <[email protected]> Co-authored-by: kf <[email protected]>

…2184) Signed-off-by: kf <[email protected]> Signed-off-by: tjtanaavllm <[email protected]> Signed-off-by: vllmellm <[email protected]> Co-authored-by: kf <[email protected]> Signed-off-by: jingyu <[email protected]>

…2184) Signed-off-by: kf <[email protected]> Signed-off-by: tjtanaavllm <[email protected]> Signed-off-by: vllmellm <[email protected]> Co-authored-by: kf <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>

…2184) Signed-off-by: kf <[email protected]> Signed-off-by: tjtanaavllm <[email protected]> Signed-off-by: vllmellm <[email protected]> Co-authored-by: kf <[email protected]> Signed-off-by: Noam Gat <[email protected]>

…2184) Signed-off-by: kf <[email protected]> Signed-off-by: tjtanaavllm <[email protected]> Signed-off-by: vllmellm <[email protected]> Co-authored-by: kf <[email protected]> Signed-off-by: Avery Yingyi Huang <[email protected]>

…2184) Signed-off-by: kf <[email protected]> Signed-off-by: tjtanaavllm <[email protected]> Signed-off-by: vllmellm <[email protected]> Co-authored-by: kf <[email protected]>

…2184) Signed-off-by: kf <[email protected]> Signed-off-by: tjtanaavllm <[email protected]> Signed-off-by: vllmellm <[email protected]> Co-authored-by: kf <[email protected]> Signed-off-by: Paul Pak <[email protected]>

…2184) Signed-off-by: kf <[email protected]> Signed-off-by: tjtanaavllm <[email protected]> Signed-off-by: vllmellm <[email protected]> Co-authored-by: kf <[email protected]>

…2184) Signed-off-by: kf <[email protected]> Signed-off-by: tjtanaavllm <[email protected]> Signed-off-by: vllmellm <[email protected]> Co-authored-by: kf <[email protected]> Signed-off-by: Boyuan Feng <[email protected]>

…2184) Signed-off-by: kf <[email protected]> Signed-off-by: tjtanaavllm <[email protected]> Signed-off-by: vllmellm <[email protected]> Co-authored-by: kf <[email protected]> Signed-off-by: Diego-Castan <[email protected]>

…2184) Signed-off-by: kf <[email protected]> Signed-off-by: tjtanaavllm <[email protected]> Signed-off-by: vllmellm <[email protected]> Co-authored-by: kf <[email protected]>

…2184) Signed-off-by: kf <[email protected]> Signed-off-by: tjtanaavllm <[email protected]> Signed-off-by: vllmellm <[email protected]> Co-authored-by: kf <[email protected]> Signed-off-by: Xiao Yu <[email protected]>

…2184) Signed-off-by: kf <[email protected]> Signed-off-by: tjtanaavllm <[email protected]> Signed-off-by: vllmellm <[email protected]> Co-authored-by: kf <[email protected]>

kliuae and others added 4 commits August 1, 2025 13:16

add fused_add_rmsnorm

5542c09

Signed-off-by: kf <[email protected]> Signed-off-by: tjtanaavllm <[email protected]> Signed-off-by: vllmellm <[email protected]>

support fused rmsnorm for qwen2_vl

8f769b0

Signed-off-by: vllmellm <[email protected]>

Merge remote-tracking branch 'origin/main' into fused-rmsnorm-qwen-vl

15c0650

revert changes of rmsnorm residual on qwen2 vl model

72e0a69

Signed-off-by: vllmellm <[email protected]>

mergify bot added the qwen Related to Qwen models label Aug 4, 2025

gemini-code-assist bot reviewed Aug 4, 2025

View reviewed changes

vllmellm marked this pull request as ready for review August 4, 2025 08:49

vllmellm requested a review from sighingnow as a code owner August 4, 2025 08:49

DarkLight1337 approved these changes Aug 7, 2025

View reviewed changes

DarkLight1337 enabled auto-merge (squash) August 7, 2025 02:32

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 7, 2025

vllm-bot merged commit cbc8457 into vllm-project:main Aug 7, 2025
56 of 59 checks passed

ywang96 mentioned this pull request Aug 29, 2025

[MM Encoder] General encoder performance improvement #23884

Open

1 task

SamitHuang mentioned this pull request Sep 12, 2025

[Model] Switch to Fused RMSNorm in GLM-4.1V model #24733

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Model] Switch to Fused RMS norm in Qwen2.5_VL model. #22184

[Model] Switch to Fused RMS norm in Qwen2.5_VL model. #22184

Uh oh!

vllmellm commented Aug 4, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

github-actions bot commented Aug 4, 2025

Uh oh!

DarkLight1337 left a comment

Uh oh!

tjtanaa commented Aug 7, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[Model] Switch to Fused RMS norm in Qwen2.5_VL model. #22184

[Model] Switch to Fused RMS norm in Qwen2.5_VL model. #22184

Uh oh!

Conversation

vllmellm commented Aug 4, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Results

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

github-actions bot commented Aug 4, 2025

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

Uh oh!

tjtanaa commented Aug 7, 2025

Uh oh!

Uh oh!

Uh oh!

vllmellm commented Aug 4, 2025 •

edited by github-actions bot

Loading