-
-
Notifications
You must be signed in to change notification settings - Fork 10.4k
[Bugfix] [Performance] DeepEPHighThroughput + DeepSeek : Quant before Dispatch #21837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] [Performance] DeepEPHighThroughput + DeepSeek : Quant before Dispatch #21837
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a performance optimization for MoE layers using DeepEPHighThroughput
with block quantization (e.g., for DeepSeek models). The change correctly modifies the logic to quantize the activations before dispatching them, which reduces communication overhead and is more efficient.
The implementation is clean and effective. The condition for pre-quantization is correctly expanded to include block-quantized cases, and the call to the quantization kernel is updated to pass the correct parameters, which also fixes a potential bug that the logical change would have otherwise introduced.
Overall, the changes look solid and align well with the stated purpose. I couldn't find any issues of high
or critical
severity.
@tlrmchlsmth @bnellnm PTAL ! Thanks 🙌 |
So we still go down the "quantize after" codepath if the quantization is per-tensor? Is there some reason that quantization can't happen beforehand in that case also? Or does DeepEP not support that? |
It is a DeepEP limitation. DeepEP doesn't support that. |
Would it make sense to fake it out by replicating the scale and then resizing/truncating them after the dispatch? |
I went back and looked at the DeepEP documentation here However, it looks like we are an cc @tlrmchlsmth |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Head branch was pushed to by a user without write access
80cb125
to
fcf2fe9
Compare
… Dispatch (vllm-project#21837) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]>
… Dispatch (vllm-project#21837) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>
… Dispatch (vllm-project#21837) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Noam Gat <[email protected]>
… Dispatch (vllm-project#21837) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Paul Pak <[email protected]>
… Dispatch (vllm-project#21837) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Diego-Castan <[email protected]>
… Dispatch (vllm-project#21837) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]>
… Dispatch (vllm-project#21837) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]>
Purpose
DeepEPHighThroughput All2All kernel when used with DeepSeek models dispatches the tokens in 16bit datatype and quantizes after dispatch. This is inefficient for 2 reasons,
This PR introduces a fix to quantize to fp8 first and then dispatch the fp8 tensor.
Test Plan
canhazgpu run -g2 -- pytest -s tests/kernels/moe/test_modular_kernel_combinations.py
canhazgpu run -g2 -- pytest tests/kernels/moe/test_deepep_deepgemm_moe.py
Test Result
All tests pass for
canhazgpu run -g2 -- pytest -s tests/kernels/moe/test_modular_kernel_combinations.py
All tests pass for
canhazgpu run -g2 -- pytest tests/kernels/moe/test_deepep_deepgemm_moe.py