Skip to content

Conversation

momo609
Copy link
Collaborator

@momo609 momo609 commented Sep 16, 2025

What this PR does / why we need it?

Add e2e test related to weight updates in RL scenarios.

Due to CI issues, the newly added Python test files cannot locate the correct path. As a temporary solution, use absolute paths to add test cases.

How was this patch tested?

Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an end-to-end test for weight updates, which is a valuable addition. However, the implementation in the example script and test file has several areas for improvement. Key issues include fragile reliance on internal vLLM APIs, which poses a significant maintenance risk, as well as several instances of incorrect or misleading information in comments, logs, and documentation. Addressing these points will enhance the code's quality, robustness, and maintainability.

enable_sleep_mode=enable_sleep_mode,
)
model_path = model
model = llm.llm_engine.model_executor.driver_worker.worker.model_runner.model
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Accessing llm.llm_engine.model_executor.driver_worker.worker.model_runner.model is a fragile pattern that deeply couples this script to vLLM's internal implementation. This is prone to breaking with future updates. Please use a public, stable API to get the model object for weight loading. If one doesn't exist, this dependency on internal structure should be noted as a significant technical debt and risk.

Comment on lines 22 to 56
Usage:
Single node:
Dense models:
python examples/offline_external_launcher.py \
--model="Qwen/Qwen2.5-0.5B-Instruct" \
--tp-size=1 \
--proc-per-node=2
MOE models:
python examples/offline_external_launcher.py \
--model="Qwen/Qwen3-30B-A3B" \
--tp-size=2 \
--proc-per-node=2 \
--enable-expert-parallel

Multi-node:
Node 0 (assume the node has ip of 10.99.48.128):
python examples/offline_external_launcher.py \
--model="Qwen/Qwen3-30B-A3B" \
--tp-size=2 \
--node-size=2 \
--node-rank=0 \
--proc-per-node=2 \
--enable-expert-parallel \
--master-addr=10.99.48.128 \
--master-port=13345
Node 1:
python examples/offline_external_launcher.py \
--model="Qwen/Qwen3-30B-A3B" \
--tp-size=2 \
--node-size=2 \
--node-rank=1 \
--enable-expert-parallel \
--master-addr=10.99.48.128 \
--master-port=13345
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The usage examples in the docstring refer to offline_external_launcher.py, but the filename is offline_weight_load.py. This is incorrect and will cause users to fail when trying to run this example script. Please update all occurrences (lines 25, 30, 38, 48).

Comment on lines 76 to 78
# Define MLP attribute mapping for different model types
MLP_ATTR_MAPPING = {}
DEFAULT_MLP_ATTR = "mlp"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The comment Define MLP attribute mapping for different model types is misleading because the MLP_ATTR_MAPPING dictionary is empty, causing the function to always use the default 'mlp'. This makes the function less flexible than it appears. To improve clarity, either populate the mapping to support other models or simplify the code to remove the unused mapping and misleading comment.

merged_dict = {}

if not os.path.isdir(directory):
raise ValueError(f"目录不存在: {directory}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The error message is in Chinese. To maintain consistency with the rest of the codebase and ensure accessibility for all contributors, please use English for all user-facing strings, including error messages.

Suggested change
raise ValueError(f"目录不存在: {directory}")
raise ValueError(f"Directory not found: {directory}")

Comment on lines +175 to +187
def main(
local_rank: int,
rank: int,
master_addr: str,
master_port: int,
model_weight_gib: float,
model: str = "Qwen/Qwen3-30B-A3B",
world_size: int = 4,
tensor_parallel_size: int = 2,
enable_expert_parallel: bool = False,
enforce_eager: bool = True,
trust_remote_code: bool = True,
enable_sleep_mode: bool = False,
temperature: float = 0.8,
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The main function defines default arguments that differ from those in parse_args (e.g., --model). This creates two sources of truth and is confusing. The arguments are then passed individually to main. It would be cleaner to have parse_args as the single source for defaults and pass the args object to main, or at least remove the misleading defaults from main's signature.

proc.join(timeout=600)
if proc.exitcode is None:
print(
f"Killing process {proc.pid} that didn't stop within 30 minutes."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The process join timeout is set to 600 seconds (10 minutes), but the corresponding log message for a timeout event states '30 minutes'. This inconsistency can be misleading when debugging. Please align the log message with the actual timeout value.

Suggested change
f"Killing process {proc.pid} that didn't stop within 30 minutes."
f"Killing process {proc.pid} that didn't stop within 10 minutes."

"""
Compare the outputs of vLLM with and without aclgraph.

Run `pytest tests/multicard/test_external_launcher.py`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The docstring provides an incorrect command to run the test, referencing test_external_launcher.py instead of the current file, test_weight_loader.py. This is likely a copy-paste error and should be corrected to avoid confusion.

Suggested change
Run `pytest tests/multicard/test_external_launcher.py`.
Run `pytest tests/e2e/multicard/test_weight_loader.py`.

@momo609 momo609 force-pushed the weightloader branch 5 times, most recently from 76b715c to f2a9a4c Compare September 17, 2025 01:44
@weijinqian0
Copy link
Collaborator

approved

@wangxiyuan wangxiyuan added ready read for review ready-for-test start test by label for PR labels Sep 17, 2025
@momo609 momo609 force-pushed the weightloader branch 5 times, most recently from 03e5684 to f128429 Compare September 18, 2025 11:52
@github-actions github-actions bot added merge-conflicts and removed ready read for review labels Sep 20, 2025
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@momo609 momo609 force-pushed the weightloader branch 2 times, most recently from 33068b7 to 2fe0308 Compare September 25, 2025 07:17
@momo609 momo609 added the ready read for review label Sep 25, 2025
Signed-off-by: wangxiaoxin-sherie <[email protected]>
Co-authored-by: Shangwei-Li <[email protected]>

@pytest.mark.parametrize("model", MOE_MODELS)
def test_external_launcher_eager(model):
script = script = "/usr/local/python3.11.13/bin/python3.11/__w/vllm-ascend/tests/examples/test_weight_loader.py"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with this temporarily to quickly enable this test case. Maybe we could try to fix it in the next pr? @momo609

@MengqingCao MengqingCao merged commit 8406aaf into vllm-project:main Sep 26, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:tests ready read for review ready-for-test start test by label for PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants