-
-
Notifications
You must be signed in to change notification settings - Fork 10.4k
[Kernel] CUTLASS MoE FP8: Integrate cuda moe permute/unpermute #23045
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request integrates CUDA permute/unpermute kernels for MoE FP8 operations, aiming to improve performance. The changes include refactoring the MoE data preparation and finalization steps, introducing new CUDA kernels, and updating the Python bindings and benchmarks accordingly. My review found a few areas for improvement. There is a recurring typo in a variable name in one of the CUDA files, which should be corrected for consistency and readability. Additionally, there are redundant attribute assignments in two classes which can be removed to make the code cleaner and more maintainable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work! Please also fix as Gemini suggests, DCO and pre-commit issue.
Excellent analysis and work! We were just talking about unreverting Eliza's work this week, so this is timely. |
Signed-off-by: Shixian Cui <[email protected]>
de5e9d0
to
ca9e4f6
Compare
@mgoin @yewentao256 Thanks for the quick review! Addressed all comments and attached quality test in the description. |
@mgoin @yewentao256 regarding the unrelated pytest error I'm able to track it down to #21083 (cuda fp8 block quant kernel) which uses |
Signed-off-by: Shixian Cui <[email protected]>
4fb9926
to
1b8b4b1
Compare
Nice work, I really like the speedups! Regarding the failed fused_moe tests, did you manually inspect the ground truth vs. CUTLASS MoE outputs to confirm that they look similar and max diff is triggered for only a few elements? |
Signed-off-by: Shixian Cui <[email protected]>
@ElizaWszola Thanks for reviewing! The failed unittest is not for cutlass but for triton moe, please see my previous comment:
but yeah it's only triggered for < 0.01% elements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, thank you!
…project#23045) Signed-off-by: Shixian Cui <[email protected]> Signed-off-by: Duncan Moss <[email protected]>
…project#23045) Signed-off-by: Shixian Cui <[email protected]>
…project#23045) Signed-off-by: Shixian Cui <[email protected]>
…project#23045) Signed-off-by: Shixian Cui <[email protected]> Signed-off-by: Xiao Yu <[email protected]>
…project#23045) Signed-off-by: Shixian Cui <[email protected]>
…project#23045) Signed-off-by: Shixian Cui <[email protected]>
…project#23045) Signed-off-by: Shixian Cui <[email protected]>
…project#23045) Signed-off-by: Shixian Cui <[email protected]>
Purpose
Integrate permute/unpermute cuda kernels from #17934 into run_cutlass_moe_fp8 to speed up the ops before and after cutlass matmuls:
moe_permute
to computeexpert_offsets
andpermuted_hidden_states
. And exposecompute_problem_sizes
kernel to python level to compute problem_sizes alone (based on NCU profiling this step takes only 1-2% of layer latency so did not do further fusion).moe_unpermute
which fuses weight_mul and unpermute together. ThereforeTopKWeightAndReduceNoOP
is used to overridefinalize_weight_and_reduce_impl()
.Additionally:
moe_shuffle
proposed in this previous PR (benchmark comparison attached below).Note:


Profiling shows that Triton FusedMoE kernel still beats cutlass kernel on lower batch sizes because heavier cutlass preprocessing steps (e.g. permutation, more input tensors to prepare) I plan to have a follow up PR to route lower BS to triton.
NCU Profiling comparison between triton vs. cutlass on M=16. C3X Grouped GEMM matmuls only takes 28% moe latency vs. Triton 63%. This overhead becomes smaller on larger M where CUTLASS will beat Triton.
CUTLASS Grouped GEMM (top2 kernels are 2 matmuls, followed by long list of kernel calls):
Triton FusedMoE (much less kernel launched):
Test Plan
pytest tests/kernels/moe/*
Test Result
1. pytest tests/kernels/moe/*
33 failed, 6218 passed, 1852 skipped, 7 warnings in 3771.32s (1:02:51)
All failed in:
FAILED tests/kernels/moe/test_block_fp8.py::test_w8a8_block_fp8_fused_moe
The error is related to TritonExperts and I saw same error on main w/o my changes. I'll see if I can track this down to particular commit ...
2. Kernel Comparison (unit: ms) vs. PR #20762 shuffle_rows
3. Performance Test: layer benchmark against baseline cutlass moe
Mixtral: RedHatAI/Mixtral-8x7B-Instruct-v0.1-FP8
LLaMA4: RedHatAI/Llama-4-Maverick-17B-128E-Instruct-FP8
QWEN3: Qwen/Qwen3-235B-A22B
4. Quality Test:
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.