[Model][gpt-oss] Support DP+EP for GPT-OSS with FlashInfer trtllm-gen MoE #23819

nvpohanh · 2025-08-28T09:55:47Z

Changes:

Enable EP for GPT-OSS with FlashInfer trtllm-gen MoE
Fix an issue that VLLM_USE_FLASHINFER_MOE_FP4 is checked even when the quant dtype is not nvfp4.

Purpose

Test Plan

Run GPT-OSS-120b with DP+EP on B200x2

Server command:

export VLLM_SKIP_P2P_CHECK=1
export VLLM_USE_FLASHINFER_MOE_FP8=1
export VLLM_USE_FLASHINFER_MOE_FP4=1
export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1
export VLLM_FLASHINFER_ALLREDUCE_FUSION_THRESHOLDS_MB='{"2":32,"4":32,"8":8}'
# ASYNC_SCHEDULING_FLAG="--async-scheduling"
ASYNC_SCHEDULING_FLAG=""
FUSION_FLAG='{"pass_config":{"enable_fi_allreduce_fusion":true,"enable_attn_fusion":true,"enable_noop":true},"custom_ops":["+quant_fp8","+rms_norm"],"cudagraph_mode":"FULL_DECODE_ONLY","splitting_ops":[]}'

vllm serve ${MODEL_NAME} \
  --host 0.0.0.0 \
  --port 8000 \
  --kv-cache-dtype auto \
  --trust-remote-code \
  --gpu-memory-utilization 0.9 \
  --compilation-config ${FUSION_FLAG} \
  ${ASYNC_SCHEDULING_FLAG} \
  --enable-chunked-prefill \
  --no-enable-prefix-caching \
  --pipeline-parallel-size 1 \
  --tensor-parallel-size 2 --enable-expert-parallel \
  --max-num-seqs 128 \
  --max-num-batched-tokens 8192 \
  --max-model-len 2048 &

Accuracy command:

    python3 -m gpt_oss.evals --sampler chat_completions \
    --model ${MODEL_NAME} \
    --reasoning-effort low \
    --n-threads 128 \
    --eval gpqa

Test Result

Writing report to gpt-oss-120b-low_temp1.0_20250828_095528.html
{'chars': np.float64(93.82828282828282), 'chars:std': np.float64(252.986833525888), 'score': np.float64(0.6376262626262627), 'score:std': np.float64(0.4806859804857294)}
Writing results to gpt-oss-120b-low_temp1.0_20250828_095528.json
Writing all results to gpt-oss-120b-low_temp1.0_20250828_095528_allresults.json
[{'eval_name': 'gpqa', 'model_name': 'gpt-oss-120b-low_temp1.0_20250828_095528', 'metric': 0.6376262626262627}]

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request enables Expert Parallelism for GPT-OSS with FlashInfer trtllm-gen MoE and fixes an issue with checking the quantization data type. The changes are generally well-implemented, but I've identified a critical issue where a missing null check could lead to a runtime error. I have provided a code suggestion to address this potential crash.

vllm/model_executor/layers/fused_moe/config.py

… MoE Changes: - Enable EP for GPT-OSS with FlashInfer trtllm-gen MoE - Fix an issue that VLLM_USE_FLASHINFER_MOE_FP4 is checked even when the quant dtype is not nvfp4. Signed-off-by: Po-Han Huang <[email protected]>

mgoin

Nice and clean!

… MoE (vllm-project#23819) Signed-off-by: Po-Han Huang <[email protected]>

nvpohanh requested review from mgoin, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners August 28, 2025 09:55

mergify bot added the gpt-oss Related to GPT-OSS models label Aug 28, 2025

gemini-code-assist bot reviewed Aug 28, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/config.py Outdated Show resolved Hide resolved

nvpohanh force-pushed the dev/nvpohanh/fix-gpt-oss-dep branch from 64451ad to fb0a767 Compare August 28, 2025 10:04

nvpohanh force-pushed the dev/nvpohanh/fix-gpt-oss-dep branch from fb0a767 to 1c3d8a7 Compare August 28, 2025 10:15

mgoin approved these changes Aug 28, 2025

View reviewed changes

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 28, 2025

mgoin enabled auto-merge (squash) August 28, 2025 10:35

vllm-bot merged commit 9508960 into vllm-project:main Aug 28, 2025
51 of 53 checks passed

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025

[Model][gpt-oss] Support DP+EP for GPT-OSS with FlashInfer trtllm-gen…

bd818fa

… MoE (vllm-project#23819) Signed-off-by: Po-Han Huang <[email protected]>

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Sep 3, 2025

[Model][gpt-oss] Support DP+EP for GPT-OSS with FlashInfer trtllm-gen…

9a12b16

… MoE (vllm-project#23819) Signed-off-by: Po-Han Huang <[email protected]>

eicherseiji pushed a commit to eicherseiji/vllm that referenced this pull request Sep 9, 2025

[Model][gpt-oss] Support DP+EP for GPT-OSS with FlashInfer trtllm-gen…

1ffbd04

… MoE (vllm-project#23819) Signed-off-by: Po-Han Huang <[email protected]>

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

[Model][gpt-oss] Support DP+EP for GPT-OSS with FlashInfer trtllm-gen…

a3b362d

… MoE (vllm-project#23819) Signed-off-by: Po-Han Huang <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Model][gpt-oss] Support DP+EP for GPT-OSS with FlashInfer trtllm-gen MoE #23819

[Model][gpt-oss] Support DP+EP for GPT-OSS with FlashInfer trtllm-gen MoE #23819

Uh oh!

nvpohanh commented Aug 28, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

mgoin left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[Model][gpt-oss] Support DP+EP for GPT-OSS with FlashInfer trtllm-gen MoE #23819

[Model][gpt-oss] Support DP+EP for GPT-OSS with FlashInfer trtllm-gen MoE #23819

Uh oh!

Conversation

nvpohanh commented Aug 28, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

nvpohanh commented Aug 28, 2025 •

edited by github-actions bot

Loading