Skip to content

Conversation

tjtanaavllm
Copy link

@tjtanaavllm tjtanaavllm commented Sep 11, 2025

Purpose

Sync with upstream to get latest changes and to fix the Compressed Tensor FP8 weight loading accuracy issue.

Upgrade vLLM version to 0.10.2rc2.dev+ge408272

Test Plan

Validated on model of interests

  1. Qwen/Qwen3-235B-A22B-FP8
MODEL=Qwen/Qwen3-235B-A22B-FP8


VLLM_USE_V1=1
VLLM_ROCM_USE_AITER=1 \
vllm serve $MODEL \
--tensor-parallel-size 8 \
--max-num-batched-tokens 32768 \
--disable-log-requests \
--kv-cache-dtype fp8 \
--compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
--trust-remote-code \
--enable_expert_parallel \
--port 6789 \
> server-Qwen_Qwen3-235B-A22B-FP8-aiter-v1-fp8-cudagraph_FULL.log 2>&1
  1. Qwen/Qwen2.5-VL-72B-Instruct
VLLM_RPC_TIMEOUT=1800000 \
VLLM_USE_V1=1 \
VLLM_ROCM_USE_AITER=1 \
VLLM_ROCM_USE_AITER_MHA=1 \
VLLM_USE_TRITON_FLASH_ATTN=0 \
SAFETENSORS_FAST_GPU=1 \
vllm serve Qwen/Qwen2.5-VL-72B-Instruct \
 -tp 8 \
 --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
 --trust_remote_code \
 --mm-processor-kwargs '{"max_pixels":802816,"min_pixels":3136}' \
 --limit-mm-per-prompt='{"image": 64}' \
 --mm-encoder-tp-mode "data" \
> server_Qwen_Qwen2.5-VL-72B-Instruct-aiter-v1-tp8-dp8-cudagraph_FULL_AND_PIECEWISE.log 2>&1
  1. Qwen/Qwen2.5-VL-3B-Instruct
VLLM_RPC_TIMEOUT=1800000 \
VLLM_USE_V1=1 \
VLLM_ROCM_USE_AITER=1 \
VLLM_ROCM_USE_AITER_MHA=1 \
VLLM_USE_TRITON_FLASH_ATTN=0 \
SAFETENSORS_FAST_GPU=1 \
vllm serve Qwen/Qwen2.5-VL-3B-Instruct \
 -tp 1 \
 --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
 --trust_remote_code \
 --mm-processor-kwargs '{"max_pixels":802816,"min_pixels":3136}' \
 --limit-mm-per-prompt='{"image": 64}' \
> server_Qwen_Qwen2.5-VL-3B-Instruct-aiter-v1-tp1-cudagraph_FULL_AND_PIECEWISE.log 2>&1

Test Result

  1. Qwen/Qwen3-235B-A22B-FP8
local-completions (model=Qwen/Qwen3-235B-A22B-FP8,base_url=http://127.0.0.1:6789/v1/completions), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 100
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8423|±  |0.0100|
|     |       |strict-match    |     5|exact_match|↑  |0.8150|±  |0.0107|

  1. Qwen/Qwen2.5-VL-72B-Instruct
For detailed information on this command, run:
  run.py eval_vllm --model_name Qwen/Qwen2.5-VL-72B-Instruct --url http://0.0.0.0:8000 --output_dir ./chartqa --eval_name chartqa - --help
================================================================================
Metrics:
{
    "explicit_prompt_relaxed_correctness": 0.8684,
    "anywhere_in_answer_relaxed_correctness": 0.8852
}
================================================================================
  1. Qwen/Qwen2.5-VL-3B-Instruct
For detailed information on this command, run:
  run.py eval_vllm --model_name Qwen/Qwen2.5-VL-3B-Instruct --url http://0.0.0.0:8000 --output_dir ./chartqa --eval_name chartqa - --help
================================================================================
Metrics:
{
    "explicit_prompt_relaxed_correctness": 0.8104,
    "anywhere_in_answer_relaxed_correctness": 0.8144
}
================================================================================


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

tdoublep and others added 30 commits August 30, 2025 00:16
Signed-off-by: zjy0516 <[email protected]>
Signed-off-by: Jiangyun Zhu <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
…23971)

Signed-off-by: sadeghja1070 <[email protected]>
Signed-off-by: Jee Jee Li <[email protected]>
Co-authored-by: Claude <[email protected]>
Co-authored-by: Jee Jee Li <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
Signed-off-by: Gabriel Marinho <[email protected]>
Signed-off-by: Gabriel Marinho <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Max de Bayser <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Co-authored-by: Roger Wang <[email protected]>
Co-authored-by: Roger Wang <[email protected]>
…TQ and AutoRound-GPTQ) (vllm-project#23994)

Signed-off-by: JartX <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
Signed-off-by: Christian Pinto <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Co-authored-by: Max de Bayser <[email protected]>
…L_LEN (vllm-project#20904)

Signed-off-by: wang.yuqi <[email protected]>
Signed-off-by: Harry Mellor <[email protected]>
Co-authored-by: Harry Mellor <[email protected]>
tlrmchlsmth and others added 17 commits September 10, 2025 00:32
…ct#23845)

Signed-off-by: Omer Dayan (SW-GPU) <[email protected]>
Signed-off-by: Peter Schuurman <[email protected]>
Co-authored-by: Omer Dayan (SW-GPU) <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@tjtanaavllm tjtanaavllm merged commit 84bf287 into llama_fp8_03122025 Sep 12, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.