[Bugfix] Fix accuracy issue when using flashinfer cutlass moe, TP=1 and modelopt. #23125

bnellnm · 2025-08-18T18:22:36Z

Purpose

Fix accuracy issue when using flash infer cutlass moe, TP=1 and modelopt. Alternative fix to #23094.

Q: Does compressed_tensors_moe.py need a similar change?

Test Plan

VLLM_USE_FLASHINFER_MOE_FP4=1 VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASHINFER lm_eval --model vllm --model_args pretrained=nvidia/Llama-4-Scout-17B-16E-Instruct-FP4,quantization=modelopt_fp4,tensor_parallel_size=1,max_model_len=2048,kv_cache_dtype=auto --gen_kwargs temperature=0.0 --limit 500 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size 200

Test w/compressed tensors

VLLM_USE_TRTLLM_ATTENTION=0 VLLM_USE_FLASHINFER_MOE_FP4=1 VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASHINFER lm_eval --model vllm --model_args pretrained=RedHatAI/Llama-4-Scout-17B-16E-Instruct-NVFP4,quantization="compressed-tensors",tensor_parallel_size=1,max_model_len=2048,kv_cache_dtype=auto --gen_kwargs temperature=0.0 --limit 500 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size 200

Run benchmark before #22035 + after.

VLLM_USE_FLASHINFER_MOE_FP4=1 VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASHINFER python3 benchmarks/benchmark_throughput.py --model=nvidia/Llama-4-Scout-17B-16E-Instruct-FP4 --quantization=modelopt_fp4 --tensor_parallel_size=1 --max_model_len=2048 --kv_cache_dtype=auto --trust_remote_code  --num-prompts 128 --input-len=1024 --output-len=1024

Test Result

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.92|±  |0.0121|
|     |       |strict-match    |     5|exact_match|↑  | 0.90|±  |0.0134|

Compressed tensors

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.922|±  |0.0120|
|     |       |strict-match    |     5|exact_match|↑  |0.912|±  |0.0127|

Benchmark before #22035:

Throughput: 3.61 requests/s, 7386.71 total tokens/s, 3696.12 output tokens/s
Total num prompt tokens:  130876
Total num output tokens:  131072

Benchmark after:

Throughput: 3.65 requests/s, 7484.65 total tokens/s, 3742.32 output tokens/s
Total num prompt tokens:  131072
Total num output tokens:  131072

(Optional) Documentation Update

cc @varun-sundar-rabindranath , @yewentao256 , @amirkl94

github-actions · 2025-08-18T18:22:46Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request addresses an accuracy issue with FlashInfer's CUTLASS MoE implementation when tensor parallelism is set to 1. The fix introduces a dedicated path for this configuration. However, the current implementation introduces a performance concern by repeatedly creating MoE kernels on each forward pass. My review includes suggestions to cache this kernel to avoid performance degradation, which involves refactoring the new helper function and caching the kernel within the ModelOptNvFp4FusedMoE method.

vllm/model_executor/layers/fused_moe/flashinfer_cutlass_moe.py

vllm/model_executor/layers/quantization/modelopt.py

yewentao256

Thanks for the work! Could you also benchmark the e2e performance?

vllm/model_executor/layers/quantization/modelopt.py

bnellnm · 2025-08-18T19:44:13Z

Thanks for the work! Could you also benchmark the e2e performance?

Added some benchmark results to the summary.

bnellnm · 2025-08-18T21:10:00Z

@mgoin, @yewentao256 , @amirkl94 do you guys know if compressed_tensors_moe.py needs a similar fix?

mgoin · 2025-08-18T21:26:49Z

@bnellnm yes I'm sure it needs to be updated, but I'm not sure how close it matches the current state in modelopt. It is supposed to be essentially the same, but we haven't refactored to share everything yet

bnellnm · 2025-08-18T21:35:18Z

@bnellnm yes I'm sure it needs to be updated, but I'm not sure how close it matches the current state in modelopt. It is supposed to be essentially the same, but we haven't refactored to share everything yet

Ok, I'll do a similar modification for compressed_tensors_moe.py. Do you know of a model I could use to verify that it's working properly?

Signed-off-by: Bill Nell <[email protected]>

bnellnm · 2025-08-18T22:14:47Z

@bnellnm yes I'm sure it needs to be updated, but I'm not sure how close it matches the current state in modelopt. It is supposed to be essentially the same, but we haven't refactored to share everything yet

Ok, I'll do a similar modification for compressed_tensors_moe.py. Do you know of a model I could use to verify that it's working properly?

Nevermind, I found one.

mgoin

Thank you

…nd modelopt. (vllm-project#23125) Signed-off-by: Bill Nell <[email protected]> Co-authored-by: Michael Goin <[email protected]>

…nd modelopt. (vllm-project#23125) Signed-off-by: Bill Nell <[email protected]> Co-authored-by: Michael Goin <[email protected]> Signed-off-by: Duncan Moss <[email protected]>

…nd modelopt. (vllm-project#23125) Signed-off-by: Bill Nell <[email protected]> Co-authored-by: Michael Goin <[email protected]>

…nd modelopt. (vllm-project#23125) Signed-off-by: Bill Nell <[email protected]> Co-authored-by: Michael Goin <[email protected]> Signed-off-by: Xiao Yu <[email protected]>

…nd modelopt. (vllm-project#23125) Signed-off-by: Bill Nell <[email protected]> Co-authored-by: Michael Goin <[email protected]>

bnellnm requested review from mgoin, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners August 18, 2025 18:22

gemini-code-assist bot reviewed Aug 18, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/flashinfer_cutlass_moe.py Outdated Show resolved Hide resolved

vllm/model_executor/layers/quantization/modelopt.py Outdated Show resolved Hide resolved

bnellnm mentioned this pull request Aug 18, 2025

bugfix: Fix nvfp4 FusedMoE flashinfer-cutlass #23094

Closed

yewentao256 reviewed Aug 18, 2025

View reviewed changes

mgoin reviewed Aug 18, 2025

View reviewed changes

vllm/model_executor/layers/quantization/modelopt.py Outdated Show resolved Hide resolved

bnellnm added 4 commits August 18, 2025 18:13

fp4 modelopt bugfix

4962ab8

Signed-off-by: Bill Nell <[email protected]>

add comment

80a9d31

Signed-off-by: Bill Nell <[email protected]>

lint

b1530e0

Signed-off-by: Bill Nell <[email protected]>

apply similar fix to compressed tensors

10ddcf1

Signed-off-by: Bill Nell <[email protected]>

bnellnm force-pushed the bugfix branch from a924b86 to 10ddcf1 Compare August 18, 2025 22:13

mgoin mentioned this pull request Aug 19, 2025

[NVIDIA] Add SM100 Flashinfer Cutlass MoE fp8 backend #22357

Merged

Merge branch 'main' into bugfix

a903fba

mgoin changed the title ~~[Bugfix] Fix accuracy issue when using flash infer cutlass moe, TP=1 and modelopt.~~ [Bugfix] Fix accuracy issue when using flashinfer cutlass moe, TP=1 and modelopt. Aug 19, 2025

mgoin added bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed labels Aug 19, 2025

mgoin approved these changes Aug 19, 2025

View reviewed changes

mgoin merged commit b94faf9 into vllm-project:main Aug 19, 2025
55 checks passed

mgoin deleted the bugfix branch August 19, 2025 18:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bugfix] Fix accuracy issue when using flashinfer cutlass moe, TP=1 and modelopt. #23125

[Bugfix] Fix accuracy issue when using flashinfer cutlass moe, TP=1 and modelopt. #23125

Uh oh!

bnellnm commented Aug 18, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Aug 18, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

yewentao256 left a comment

Uh oh!

Uh oh!

bnellnm commented Aug 18, 2025

Uh oh!

bnellnm commented Aug 18, 2025

Uh oh!

mgoin commented Aug 18, 2025

Uh oh!

bnellnm commented Aug 18, 2025

Uh oh!

bnellnm commented Aug 18, 2025

Uh oh!

mgoin left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[Bugfix] Fix accuracy issue when using flashinfer cutlass moe, TP=1 and modelopt. #23125

[Bugfix] Fix accuracy issue when using flashinfer cutlass moe, TP=1 and modelopt. #23125

Uh oh!

Conversation

bnellnm commented Aug 18, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

github-actions bot commented Aug 18, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

yewentao256 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bnellnm commented Aug 18, 2025

Uh oh!

bnellnm commented Aug 18, 2025

Uh oh!

mgoin commented Aug 18, 2025

Uh oh!

bnellnm commented Aug 18, 2025

Uh oh!

bnellnm commented Aug 18, 2025

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bnellnm commented Aug 18, 2025 •

edited by github-actions bot

Loading