Add e2e test related to weight updates in RL scenarios. #2954

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

MengqingCao merged 1 commit into vllm-project:main from momo609:weightloader

Sep 26, 2025

+514 −0

Collaborator

momo609 commented Sep 16, 2025 •

edited

Loading

What this PR does / why we need it?

Add e2e test related to weight updates in RL scenarios.

Due to CI issues, the newly added Python test files cannot locate the correct path. As a temporary solution, use absolute paths to add test cases.

How was this patch tested?

vLLM version: v0.10.2
vLLM main: vllm-project/vllm@52d0cb8

github-actions bot added the module:tests label

github-actions bot commented Sep 16, 2025

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist bot reviewed

View reviewed changes

Contributor

gemini-code-assist bot left a comment

Code Review

This pull request introduces an end-to-end test for weight updates, which is a valuable addition. However, the implementation in the example script and test file has several areas for improvement. Key issues include fragile reliance on internal vLLM APIs, which poses a significant maintenance risk, as well as several instances of incorrect or misleading information in comments, logs, and documentation. Addressing these points will enhance the code's quality, robustness, and maintainability.

examples/offline_weight_load.py Outdated

+                      enable_sleep_mode=enable_sleep_mode,
+                  )
+                  model_path = model
+                  model = llm.llm_engine.model_executor.driver_worker.worker.model_runner.model

Contributor

gemini-code-assist bot Sep 16, 2025

Accessing llm.llm_engine.model_executor.driver_worker.worker.model_runner.model is a fragile pattern that deeply couples this script to vLLM's internal implementation. This is prone to breaking with future updates. Please use a public, stable API to get the model object for weight loading. If one doesn't exist, this dependency on internal structure should be noted as a significant technical debt and risk.

examples/offline_weight_load.py

Comment on lines 22 to 56

+              Usage:
+              Single node:
+                  Dense models:
+                      python examples/offline_external_launcher.py \
+                              --model="Qwen/Qwen2.5-0.5B-Instruct" \
+                              --tp-size=1 \
+                              --proc-per-node=2
+                  MOE models:
+                      python examples/offline_external_launcher.py \
+                              --model="Qwen/Qwen3-30B-A3B" \
+                              --tp-size=2 \
+                              --proc-per-node=2 \
+                              --enable-expert-parallel
+              Multi-node:
+                  Node 0 (assume the node has ip of 10.99.48.128):
+                          python examples/offline_external_launcher.py \
+                                  --model="Qwen/Qwen3-30B-A3B" \
+                                  --tp-size=2 \
+                                  --node-size=2 \
+                                  --node-rank=0 \
+                                  --proc-per-node=2 \
+                                  --enable-expert-parallel \
+                                  --master-addr=10.99.48.128 \
+                                  --master-port=13345
+                  Node 1:
+                          python examples/offline_external_launcher.py \
+                                  --model="Qwen/Qwen3-30B-A3B" \
+                                  --tp-size=2 \
+                                  --node-size=2 \
+                                  --node-rank=1 \
+                                  --enable-expert-parallel \
+                                  --master-addr=10.99.48.128 \
+                                  --master-port=13345
+              """

Contributor

gemini-code-assist bot Sep 16, 2025

The usage examples in the docstring refer to offline_external_launcher.py, but the filename is offline_weight_load.py. This is incorrect and will cause users to fail when trying to run this example script. Please update all occurrences (lines 25, 30, 38, 48).

examples/offline_weight_load.py Outdated

Comment on lines 76 to 78

+                  # Define MLP attribute mapping for different model types
+                  MLP_ATTR_MAPPING = {}
+                  DEFAULT_MLP_ATTR = "mlp"

Contributor

gemini-code-assist bot Sep 16, 2025

The comment Define MLP attribute mapping for different model types is misleading because the MLP_ATTR_MAPPING dictionary is empty, causing the function to always use the default 'mlp'. This makes the function less flexible than it appears. To improve clarity, either populate the mapping to support other models or simplify the code to remove the unused mapping and misleading comment.

examples/offline_weight_load.py Outdated

+                  merged_dict = {}
+                  if not os.path.isdir(directory):
+                      raise ValueError(f"目录不存在: {directory}")

Contributor

gemini-code-assist bot Sep 16, 2025

The error message is in Chinese. To maintain consistency with the rest of the codebase and ensure accessibility for all contributors, please use English for all user-facing strings, including error messages.

Suggested change

      
                    raise ValueError(f"目录不存在: {directory}")
          
                    raise ValueError(f"Directory not found: {directory}")

examples/offline_weight_load.py

Comment on lines +175 to +187

+              def main(
+                  local_rank: int,
+                  rank: int,
+                  master_addr: str,
+                  master_port: int,
+                  model_weight_gib: float,
+                  model: str = "Qwen/Qwen3-30B-A3B",
+                  world_size: int = 4,
+                  tensor_parallel_size: int = 2,
+                  enable_expert_parallel: bool = False,
+                  enforce_eager: bool = True,
+                  trust_remote_code: bool = True,
+                  enable_sleep_mode: bool = False,
+                  temperature: float = 0.8,
+              ):

Contributor

gemini-code-assist bot Sep 16, 2025

The main function defines default arguments that differ from those in parse_args (e.g., --model). This creates two sources of truth and is confusing. The arguments are then passed individually to main. It would be cleaner to have parse_args as the single source for defaults and pass the args object to main, or at least remove the misleading defaults from main's signature.

examples/offline_weight_load.py

+                      proc.join(timeout=600)
+                      if proc.exitcode is None:
+                          print(
+                              f"Killing process {proc.pid} that didn't stop within 30 minutes."

Contributor

gemini-code-assist bot Sep 16, 2025

The process join timeout is set to 600 seconds (10 minutes), but the corresponding log message for a timeout event states '30 minutes'. This inconsistency can be misleading when debugging. Please align the log message with the actual timeout value.

Suggested change

      
                            f"Killing process {proc.pid} that didn't stop within 30 minutes."
          
                            f"Killing process {proc.pid} that didn't stop within 10 minutes."

tests/e2e/multicard/test_weight_loader.py

+              """
+              Compare the outputs of vLLM with and without aclgraph.
+              Run `pytest tests/multicard/test_external_launcher.py`.

Contributor

gemini-code-assist bot Sep 16, 2025

The docstring provides an incorrect command to run the test, referencing test_external_launcher.py instead of the current file, test_weight_loader.py. This is likely a copy-paste error and should be corrected to avoid confusion.

Suggested change

      
            Run `pytest tests/multicard/test_external_launcher.py`.
          
            Run `pytest tests/e2e/multicard/test_weight_loader.py`.

momo609 force-pushed the weightloader branch from 2e51b59 to 447db72 Compare

September 16, 2025 06:43

github-actions bot added module:ops and removed module:ops labels

momo609 force-pushed the weightloader branch 5 times, most recently from 76b715c to f2a9a4c Compare

September 17, 2025 01:44

Collaborator

weijinqian0 commented Sep 17, 2025

approved

wangxiyuan added ready ready-for-test labels

momo609 force-pushed the weightloader branch 5 times, most recently from 03e5684 to f128429 Compare

September 18, 2025 11:52

github-actions bot added merge-conflicts and removed ready labels

github-actions bot commented Sep 20, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

momo609 force-pushed the weightloader branch 2 times, most recently from 33068b7 to 2fe0308 Compare

September 25, 2025 07:17

github-actions bot removed the merge-conflicts label

momo609 added the ready label


          add weight loader e2e test.

10c6893

Signed-off-by: wangxiaoxin-sherie <[email protected]>
Co-authored-by: Shangwei-Li <[email protected]>

momo609 force-pushed the weightloader branch from e13c4f7 to 10c6893 Compare

September 25, 2025 09:42

MengqingCao reviewed

View reviewed changes

tests/e2e/multicard/test_weight_loader.py

+              @pytest.mark.parametrize("model", MOE_MODELS)
+              def test_external_launcher_eager(model):
+                  script = script = "/usr/local/python3.11.13/bin/python3.11/__w/vllm-ascend/tests/examples/test_weight_loader.py"

Collaborator

MengqingCao Sep 25, 2025

I'm okay with this temporarily to quickly enable this test case. Maybe we could try to fix it in the next pr? @momo609

MengqingCao approved these changes

View reviewed changes

MengqingCao merged commit 8406aaf into vllm-project:main

16 checks passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:tests ready ready-for-test