-
-
Notifications
You must be signed in to change notification settings - Fork 10.4k
[Bugfix] Fix accuracy issue when using flashinfer cutlass moe, TP=1 and modelopt. #23125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses an accuracy issue with FlashInfer's CUTLASS MoE implementation when tensor parallelism is set to 1. The fix introduces a dedicated path for this configuration. However, the current implementation introduces a performance concern by repeatedly creating MoE kernels on each forward pass. My review includes suggestions to cache this kernel to avoid performance degradation, which involves refactoring the new helper function and caching the kernel within the ModelOptNvFp4FusedMoE
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work! Could you also benchmark the e2e performance?
Added some benchmark results to the summary. |
@mgoin, @yewentao256 , @amirkl94 do you guys know if |
@bnellnm yes I'm sure it needs to be updated, but I'm not sure how close it matches the current state in modelopt. It is supposed to be essentially the same, but we haven't refactored to share everything yet |
Ok, I'll do a similar modification for compressed_tensors_moe.py. Do you know of a model I could use to verify that it's working properly? |
Signed-off-by: Bill Nell <[email protected]>
Signed-off-by: Bill Nell <[email protected]>
Signed-off-by: Bill Nell <[email protected]>
Signed-off-by: Bill Nell <[email protected]>
Nevermind, I found one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you
…nd modelopt. (vllm-project#23125) Signed-off-by: Bill Nell <[email protected]> Co-authored-by: Michael Goin <[email protected]>
…nd modelopt. (vllm-project#23125) Signed-off-by: Bill Nell <[email protected]> Co-authored-by: Michael Goin <[email protected]>
…nd modelopt. (vllm-project#23125) Signed-off-by: Bill Nell <[email protected]> Co-authored-by: Michael Goin <[email protected]>
…nd modelopt. (vllm-project#23125) Signed-off-by: Bill Nell <[email protected]> Co-authored-by: Michael Goin <[email protected]> Signed-off-by: Duncan Moss <[email protected]>
…nd modelopt. (vllm-project#23125) Signed-off-by: Bill Nell <[email protected]> Co-authored-by: Michael Goin <[email protected]>
…nd modelopt. (vllm-project#23125) Signed-off-by: Bill Nell <[email protected]> Co-authored-by: Michael Goin <[email protected]> Signed-off-by: Xiao Yu <[email protected]>
…nd modelopt. (vllm-project#23125) Signed-off-by: Bill Nell <[email protected]> Co-authored-by: Michael Goin <[email protected]>
…nd modelopt. (vllm-project#23125) Signed-off-by: Bill Nell <[email protected]> Co-authored-by: Michael Goin <[email protected]>
…nd modelopt. (vllm-project#23125) Signed-off-by: Bill Nell <[email protected]> Co-authored-by: Michael Goin <[email protected]>
…nd modelopt. (vllm-project#23125) Signed-off-by: Bill Nell <[email protected]> Co-authored-by: Michael Goin <[email protected]>
Purpose
Fix accuracy issue when using flash infer cutlass moe, TP=1 and modelopt. Alternative fix to #23094.
Q: Does
compressed_tensors_moe.py
need a similar change?Test Plan
Test w/compressed tensors
Run benchmark before #22035 + after.
Test Result
Compressed tensors
Benchmark before #22035:
Benchmark after:
(Optional) Documentation Update
cc @varun-sundar-rabindranath , @yewentao256 , @amirkl94