-
-
Notifications
You must be signed in to change notification settings - Fork 10.1k
[Model] Switch to Fused RMS norm in Qwen2.5_VL model. #22184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: kf <[email protected]> Signed-off-by: tjtanaavllm <[email protected]> Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a performance optimization in the Qwen2_5_VisionBlock
by switching to a fused RMS normalization kernel. The change correctly refactors the forward pass to use the fused fused_add_rms_norm
operation, which combines the residual addition and layer normalization steps. This is a good optimization that maintains the original logic while potentially improving performance by reducing kernel launch overhead and memory traffic. The implementation is clean, correct, and well-contained within the vision block.
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
CC. @wuhuikx |
…2184) Signed-off-by: kf <[email protected]> Signed-off-by: tjtanaavllm <[email protected]> Signed-off-by: vllmellm <[email protected]> Co-authored-by: kf <[email protected]>
…2184) Signed-off-by: kf <[email protected]> Signed-off-by: tjtanaavllm <[email protected]> Signed-off-by: vllmellm <[email protected]> Co-authored-by: kf <[email protected]> Signed-off-by: jingyu <[email protected]>
…2184) Signed-off-by: kf <[email protected]> Signed-off-by: tjtanaavllm <[email protected]> Signed-off-by: vllmellm <[email protected]> Co-authored-by: kf <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>
…2184) Signed-off-by: kf <[email protected]> Signed-off-by: tjtanaavllm <[email protected]> Signed-off-by: vllmellm <[email protected]> Co-authored-by: kf <[email protected]> Signed-off-by: Noam Gat <[email protected]>
…2184) Signed-off-by: kf <[email protected]> Signed-off-by: tjtanaavllm <[email protected]> Signed-off-by: vllmellm <[email protected]> Co-authored-by: kf <[email protected]> Signed-off-by: Avery Yingyi Huang <[email protected]>
…2184) Signed-off-by: kf <[email protected]> Signed-off-by: tjtanaavllm <[email protected]> Signed-off-by: vllmellm <[email protected]> Co-authored-by: kf <[email protected]>
…2184) Signed-off-by: kf <[email protected]> Signed-off-by: tjtanaavllm <[email protected]> Signed-off-by: vllmellm <[email protected]> Co-authored-by: kf <[email protected]> Signed-off-by: Paul Pak <[email protected]>
…2184) Signed-off-by: kf <[email protected]> Signed-off-by: tjtanaavllm <[email protected]> Signed-off-by: vllmellm <[email protected]> Co-authored-by: kf <[email protected]>
…2184) Signed-off-by: kf <[email protected]> Signed-off-by: tjtanaavllm <[email protected]> Signed-off-by: vllmellm <[email protected]> Co-authored-by: kf <[email protected]> Signed-off-by: Boyuan Feng <[email protected]>
…2184) Signed-off-by: kf <[email protected]> Signed-off-by: tjtanaavllm <[email protected]> Signed-off-by: vllmellm <[email protected]> Co-authored-by: kf <[email protected]> Signed-off-by: Diego-Castan <[email protected]>
…2184) Signed-off-by: kf <[email protected]> Signed-off-by: tjtanaavllm <[email protected]> Signed-off-by: vllmellm <[email protected]> Co-authored-by: kf <[email protected]>
…2184) Signed-off-by: kf <[email protected]> Signed-off-by: tjtanaavllm <[email protected]> Signed-off-by: vllmellm <[email protected]> Co-authored-by: kf <[email protected]>
…2184) Signed-off-by: kf <[email protected]> Signed-off-by: tjtanaavllm <[email protected]> Signed-off-by: vllmellm <[email protected]> Co-authored-by: kf <[email protected]> Signed-off-by: Xiao Yu <[email protected]>
…2184) Signed-off-by: kf <[email protected]> Signed-off-by: tjtanaavllm <[email protected]> Signed-off-by: vllmellm <[email protected]> Co-authored-by: kf <[email protected]> Signed-off-by: Xiao Yu <[email protected]>
…2184) Signed-off-by: kf <[email protected]> Signed-off-by: tjtanaavllm <[email protected]> Signed-off-by: vllmellm <[email protected]> Co-authored-by: kf <[email protected]>
…2184) Signed-off-by: kf <[email protected]> Signed-off-by: tjtanaavllm <[email protected]> Signed-off-by: vllmellm <[email protected]> Co-authored-by: kf <[email protected]>
…2184) Signed-off-by: kf <[email protected]> Signed-off-by: tjtanaavllm <[email protected]> Signed-off-by: vllmellm <[email protected]> Co-authored-by: kf <[email protected]>
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
This update introduces support for fused RMSNorm in Qwen2.5 vision language model.
Switching to a fused RMS normalization implementation simplifies the computation trace and enables further optimizations for greater performance improvements in the future. These optimizations may include increased computational efficiency and reduced memory usage by eliminating unnecessary intermediate tensors, decreasing global memory traffic, and improving GPU utilization.
Importantly, these changes do not negatively impact model performance in terms of accuracy or speed.
Benchmark results on
Qwen/Qwen2.5-VL-7B-Instruct
Test Plan
Use mistral-evals repo for accuracy evaluation.
step1:
python3 -m eval.run eval_vllm --model_name Qwen/Qwen2.5-VL-7B-Instruct --url http://0.0.0.0:8000
step2:
VLLM_ROCM_USE_AITER=1 \ SAFETENSORS_FAST_GPU=1 \ MIOPEN_FIND_MODE=FAST \ vllm serve Qwen/Qwen2.5-VL-7B-Instruct --trust_remote_code -tp 4
Test Results
eval result on
Qwen/Qwen2.5-VL-7B-Instruct
modelbefore changes:
{
"explicit_prompt_relaxed_correctness": 0.8644,
"anywhere_in_answer_relaxed_correctness": 0.8644
}
after changes:
{
"explicit_prompt_relaxed_correctness": 0.8644,
"anywhere_in_answer_relaxed_correctness": 0.8644
}