Skip to content

Conversation

nvpohanh
Copy link
Contributor

@nvpohanh nvpohanh commented Aug 28, 2025

Changes:

  • Enable EP for GPT-OSS with FlashInfer trtllm-gen MoE
  • Fix an issue that VLLM_USE_FLASHINFER_MOE_FP4 is checked even when the quant dtype is not nvfp4.

Purpose

Test Plan

Run GPT-OSS-120b with DP+EP on B200x2

Server command:

export VLLM_SKIP_P2P_CHECK=1
export VLLM_USE_FLASHINFER_MOE_FP8=1
export VLLM_USE_FLASHINFER_MOE_FP4=1
export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1
export VLLM_FLASHINFER_ALLREDUCE_FUSION_THRESHOLDS_MB='{"2":32,"4":32,"8":8}'
# ASYNC_SCHEDULING_FLAG="--async-scheduling"
ASYNC_SCHEDULING_FLAG=""
FUSION_FLAG='{"pass_config":{"enable_fi_allreduce_fusion":true,"enable_attn_fusion":true,"enable_noop":true},"custom_ops":["+quant_fp8","+rms_norm"],"cudagraph_mode":"FULL_DECODE_ONLY","splitting_ops":[]}'

vllm serve ${MODEL_NAME} \
  --host 0.0.0.0 \
  --port 8000 \
  --kv-cache-dtype auto \
  --trust-remote-code \
  --gpu-memory-utilization 0.9 \
  --compilation-config ${FUSION_FLAG} \
  ${ASYNC_SCHEDULING_FLAG} \
  --enable-chunked-prefill \
  --no-enable-prefix-caching \
  --pipeline-parallel-size 1 \
  --tensor-parallel-size 2 --enable-expert-parallel \
  --max-num-seqs 128 \
  --max-num-batched-tokens 8192 \
  --max-model-len 2048 &

Accuracy command:

    python3 -m gpt_oss.evals --sampler chat_completions \
    --model ${MODEL_NAME} \
    --reasoning-effort low \
    --n-threads 128 \
    --eval gpqa 

Test Result

Writing report to gpt-oss-120b-low_temp1.0_20250828_095528.html
{'chars': np.float64(93.82828282828282), 'chars:std': np.float64(252.986833525888), 'score': np.float64(0.6376262626262627), 'score:std': np.float64(0.4806859804857294)}
Writing results to gpt-oss-120b-low_temp1.0_20250828_095528.json
Writing all results to gpt-oss-120b-low_temp1.0_20250828_095528_allresults.json
[{'eval_name': 'gpqa', 'model_name': 'gpt-oss-120b-low_temp1.0_20250828_095528', 'metric': 0.6376262626262627}]

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify bot added the gpt-oss Related to GPT-OSS models label Aug 28, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables Expert Parallelism for GPT-OSS with FlashInfer trtllm-gen MoE and fixes an issue with checking the quantization data type. The changes are generally well-implemented, but I've identified a critical issue where a missing null check could lead to a runtime error. I have provided a code suggestion to address this potential crash.

@nvpohanh nvpohanh force-pushed the dev/nvpohanh/fix-gpt-oss-dep branch from 64451ad to fb0a767 Compare August 28, 2025 10:04
… MoE

Changes:

- Enable EP for GPT-OSS with FlashInfer trtllm-gen MoE
- Fix an issue that VLLM_USE_FLASHINFER_MOE_FP4 is checked even when the
  quant dtype is not nvfp4.

Signed-off-by: Po-Han Huang <[email protected]>
@nvpohanh nvpohanh force-pushed the dev/nvpohanh/fix-gpt-oss-dep branch from fb0a767 to 1c3d8a7 Compare August 28, 2025 10:15
Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice and clean!

@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 28, 2025
@mgoin mgoin enabled auto-merge (squash) August 28, 2025 10:35
@vllm-bot vllm-bot merged commit 9508960 into vllm-project:main Aug 28, 2025
51 of 53 checks passed
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Sep 3, 2025
eicherseiji pushed a commit to eicherseiji/vllm that referenced this pull request Sep 9, 2025
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gpt-oss Related to GPT-OSS models ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants