-
Notifications
You must be signed in to change notification settings - Fork 460
Add e2e test related to weight updates in RL scenarios. #2954
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces an end-to-end test for weight updates, which is a valuable addition. However, the implementation in the example script and test file has several areas for improvement. Key issues include fragile reliance on internal vLLM APIs, which poses a significant maintenance risk, as well as several instances of incorrect or misleading information in comments, logs, and documentation. Addressing these points will enhance the code's quality, robustness, and maintainability.
examples/offline_weight_load.py
Outdated
enable_sleep_mode=enable_sleep_mode, | ||
) | ||
model_path = model | ||
model = llm.llm_engine.model_executor.driver_worker.worker.model_runner.model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accessing llm.llm_engine.model_executor.driver_worker.worker.model_runner.model
is a fragile pattern that deeply couples this script to vLLM's internal implementation. This is prone to breaking with future updates. Please use a public, stable API to get the model object for weight loading. If one doesn't exist, this dependency on internal structure should be noted as a significant technical debt and risk.
Usage: | ||
Single node: | ||
Dense models: | ||
python examples/offline_external_launcher.py \ | ||
--model="Qwen/Qwen2.5-0.5B-Instruct" \ | ||
--tp-size=1 \ | ||
--proc-per-node=2 | ||
MOE models: | ||
python examples/offline_external_launcher.py \ | ||
--model="Qwen/Qwen3-30B-A3B" \ | ||
--tp-size=2 \ | ||
--proc-per-node=2 \ | ||
--enable-expert-parallel | ||
|
||
Multi-node: | ||
Node 0 (assume the node has ip of 10.99.48.128): | ||
python examples/offline_external_launcher.py \ | ||
--model="Qwen/Qwen3-30B-A3B" \ | ||
--tp-size=2 \ | ||
--node-size=2 \ | ||
--node-rank=0 \ | ||
--proc-per-node=2 \ | ||
--enable-expert-parallel \ | ||
--master-addr=10.99.48.128 \ | ||
--master-port=13345 | ||
Node 1: | ||
python examples/offline_external_launcher.py \ | ||
--model="Qwen/Qwen3-30B-A3B" \ | ||
--tp-size=2 \ | ||
--node-size=2 \ | ||
--node-rank=1 \ | ||
--enable-expert-parallel \ | ||
--master-addr=10.99.48.128 \ | ||
--master-port=13345 | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
examples/offline_weight_load.py
Outdated
# Define MLP attribute mapping for different model types | ||
MLP_ATTR_MAPPING = {} | ||
DEFAULT_MLP_ATTR = "mlp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment Define MLP attribute mapping for different model types
is misleading because the MLP_ATTR_MAPPING
dictionary is empty, causing the function to always use the default 'mlp'
. This makes the function less flexible than it appears. To improve clarity, either populate the mapping to support other models or simplify the code to remove the unused mapping and misleading comment.
examples/offline_weight_load.py
Outdated
merged_dict = {} | ||
|
||
if not os.path.isdir(directory): | ||
raise ValueError(f"目录不存在: {directory}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message is in Chinese. To maintain consistency with the rest of the codebase and ensure accessibility for all contributors, please use English for all user-facing strings, including error messages.
raise ValueError(f"目录不存在: {directory}") | |
raise ValueError(f"Directory not found: {directory}") |
def main( | ||
local_rank: int, | ||
rank: int, | ||
master_addr: str, | ||
master_port: int, | ||
model_weight_gib: float, | ||
model: str = "Qwen/Qwen3-30B-A3B", | ||
world_size: int = 4, | ||
tensor_parallel_size: int = 2, | ||
enable_expert_parallel: bool = False, | ||
enforce_eager: bool = True, | ||
trust_remote_code: bool = True, | ||
enable_sleep_mode: bool = False, | ||
temperature: float = 0.8, | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main
function defines default arguments that differ from those in parse_args
(e.g., --model
). This creates two sources of truth and is confusing. The arguments are then passed individually to main
. It would be cleaner to have parse_args
as the single source for defaults and pass the args
object to main
, or at least remove the misleading defaults from main
's signature.
proc.join(timeout=600) | ||
if proc.exitcode is None: | ||
print( | ||
f"Killing process {proc.pid} that didn't stop within 30 minutes." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The process join timeout is set to 600 seconds (10 minutes), but the corresponding log message for a timeout event states '30 minutes'. This inconsistency can be misleading when debugging. Please align the log message with the actual timeout value.
f"Killing process {proc.pid} that didn't stop within 30 minutes." | |
f"Killing process {proc.pid} that didn't stop within 10 minutes." |
""" | ||
Compare the outputs of vLLM with and without aclgraph. | ||
|
||
Run `pytest tests/multicard/test_external_launcher.py`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring provides an incorrect command to run the test, referencing test_external_launcher.py
instead of the current file, test_weight_loader.py
. This is likely a copy-paste error and should be corrected to avoid confusion.
Run `pytest tests/multicard/test_external_launcher.py`. | |
Run `pytest tests/e2e/multicard/test_weight_loader.py`. |
2e51b59
to
447db72
Compare
76b715c
to
f2a9a4c
Compare
approved |
03e5684
to
f128429
Compare
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
33068b7
to
2fe0308
Compare
Signed-off-by: wangxiaoxin-sherie <[email protected]> Co-authored-by: Shangwei-Li <[email protected]>
e13c4f7
to
10c6893
Compare
|
||
@pytest.mark.parametrize("model", MOE_MODELS) | ||
def test_external_launcher_eager(model): | ||
script = script = "/usr/local/python3.11.13/bin/python3.11/__w/vllm-ascend/tests/examples/test_weight_loader.py" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with this temporarily to quickly enable this test case. Maybe we could try to fix it in the next pr? @momo609
What this PR does / why we need it?
Add e2e test related to weight updates in RL scenarios.
Due to CI issues, the newly added Python test files cannot locate the correct path. As a temporary solution, use absolute paths to add test cases.
How was this patch tested?