Skip to content

Conversation

bnellnm
Copy link
Contributor

@bnellnm bnellnm commented Aug 18, 2025

Purpose

Fix accuracy issue when using flash infer cutlass moe, TP=1 and modelopt. Alternative fix to #23094.

Q: Does compressed_tensors_moe.py need a similar change?

Test Plan

VLLM_USE_FLASHINFER_MOE_FP4=1 VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASHINFER lm_eval --model vllm --model_args pretrained=nvidia/Llama-4-Scout-17B-16E-Instruct-FP4,quantization=modelopt_fp4,tensor_parallel_size=1,max_model_len=2048,kv_cache_dtype=auto --gen_kwargs temperature=0.0 --limit 500 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size 200

Test w/compressed tensors

VLLM_USE_TRTLLM_ATTENTION=0 VLLM_USE_FLASHINFER_MOE_FP4=1 VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASHINFER lm_eval --model vllm --model_args pretrained=RedHatAI/Llama-4-Scout-17B-16E-Instruct-NVFP4,quantization="compressed-tensors",tensor_parallel_size=1,max_model_len=2048,kv_cache_dtype=auto --gen_kwargs temperature=0.0 --limit 500 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size 200

Run benchmark before #22035 + after.

VLLM_USE_FLASHINFER_MOE_FP4=1 VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASHINFER python3 benchmarks/benchmark_throughput.py --model=nvidia/Llama-4-Scout-17B-16E-Instruct-FP4 --quantization=modelopt_fp4 --tensor_parallel_size=1 --max_model_len=2048 --kv_cache_dtype=auto --trust_remote_code  --num-prompts 128 --input-len=1024 --output-len=1024

Test Result

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.92|±  |0.0121|
|     |       |strict-match    |     5|exact_match|↑  | 0.90|±  |0.0134|

Compressed tensors

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.922|±  |0.0120|
|     |       |strict-match    |     5|exact_match|↑  |0.912|±  |0.0127|

Benchmark before #22035:

Throughput: 3.61 requests/s, 7386.71 total tokens/s, 3696.12 output tokens/s
Total num prompt tokens:  130876
Total num output tokens:  131072

Benchmark after:

Throughput: 3.65 requests/s, 7484.65 total tokens/s, 3742.32 output tokens/s
Total num prompt tokens:  131072
Total num output tokens:  131072

(Optional) Documentation Update

cc @varun-sundar-rabindranath , @yewentao256 , @amirkl94

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses an accuracy issue with FlashInfer's CUTLASS MoE implementation when tensor parallelism is set to 1. The fix introduces a dedicated path for this configuration. However, the current implementation introduces a performance concern by repeatedly creating MoE kernels on each forward pass. My review includes suggestions to cache this kernel to avoid performance degradation, which involves refactoring the new helper function and caching the kernel within the ModelOptNvFp4FusedMoE method.

Copy link
Collaborator

@yewentao256 yewentao256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work! Could you also benchmark the e2e performance?

@bnellnm
Copy link
Contributor Author

bnellnm commented Aug 18, 2025

Thanks for the work! Could you also benchmark the e2e performance?

Added some benchmark results to the summary.

@bnellnm
Copy link
Contributor Author

bnellnm commented Aug 18, 2025

@mgoin, @yewentao256 , @amirkl94 do you guys know if compressed_tensors_moe.py needs a similar fix?

@mgoin
Copy link
Member

mgoin commented Aug 18, 2025

@bnellnm yes I'm sure it needs to be updated, but I'm not sure how close it matches the current state in modelopt. It is supposed to be essentially the same, but we haven't refactored to share everything yet

@bnellnm
Copy link
Contributor Author

bnellnm commented Aug 18, 2025

@bnellnm yes I'm sure it needs to be updated, but I'm not sure how close it matches the current state in modelopt. It is supposed to be essentially the same, but we haven't refactored to share everything yet

Ok, I'll do a similar modification for compressed_tensors_moe.py. Do you know of a model I could use to verify that it's working properly?

Signed-off-by: Bill Nell <[email protected]>
Signed-off-by: Bill Nell <[email protected]>
Signed-off-by: Bill Nell <[email protected]>
@bnellnm
Copy link
Contributor Author

bnellnm commented Aug 18, 2025

@bnellnm yes I'm sure it needs to be updated, but I'm not sure how close it matches the current state in modelopt. It is supposed to be essentially the same, but we haven't refactored to share everything yet

Ok, I'll do a similar modification for compressed_tensors_moe.py. Do you know of a model I could use to verify that it's working properly?

Nevermind, I found one.

@mgoin mgoin changed the title [Bugfix] Fix accuracy issue when using flash infer cutlass moe, TP=1 and modelopt. [Bugfix] Fix accuracy issue when using flashinfer cutlass moe, TP=1 and modelopt. Aug 19, 2025
@mgoin mgoin added bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed labels Aug 19, 2025
Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you

@mgoin mgoin merged commit b94faf9 into vllm-project:main Aug 19, 2025
55 checks passed
@mgoin mgoin deleted the bugfix branch August 19, 2025 18:00
princepride pushed a commit to princepride/vllm that referenced this pull request Aug 20, 2025
divakar-amd pushed a commit to divakar-amd/vllm_upstream that referenced this pull request Aug 20, 2025
cyang49 pushed a commit to cyang49/vllm that referenced this pull request Aug 20, 2025
djmmoss pushed a commit to djmmoss/vllm that referenced this pull request Aug 21, 2025
…nd modelopt. (vllm-project#23125)

Signed-off-by: Bill Nell <[email protected]>
Co-authored-by: Michael Goin <[email protected]>
Signed-off-by: Duncan Moss <[email protected]>
epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025
xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025
…nd modelopt. (vllm-project#23125)

Signed-off-by: Bill Nell <[email protected]>
Co-authored-by: Michael Goin <[email protected]>
Signed-off-by: Xiao Yu <[email protected]>
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025
mengxingkongzhouhan pushed a commit to mengxingkongzhouhan/vllm that referenced this pull request Aug 30, 2025
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Sep 3, 2025
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants