-
Notifications
You must be signed in to change notification settings - Fork 459
support cp&sp #2961
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
support cp&sp #2961
Conversation
Signed-off-by: LookAround <[email protected]>
Signed-off-by: weiguihua2 <[email protected]>
Signed-off-by: tanwenqin <[email protected]>
Signed-off-by: tanwenqin <[email protected]>
Signed-off-by: zhaoyifan <[email protected]>
Signed-off-by: zhangsicheng5 <[email protected]>
Signed-off-by: LookAround <[email protected]>
Signed-off-by: LookAround <[email protected]>
…nto long_seq_pr # Conflicts: # vllm_ascend/attention/attention_v1.py # vllm_ascend/ops/fused_moe.py # vllm_ascend/worker/model_runner_v1.py
Signed-off-by: LookAround <[email protected]>
Signed-off-by: LookAround <[email protected]>
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for context parallelism (CP) and sequence parallelism (SP) to enable efficient inference for models with long sequences on Ascend NPUs. The changes are extensive, touching upon the scheduler, model runner, attention implementations (vanilla and MLA), and various operational layers. New metadata structures and logic have been added to manage the distributed state across CP and SP ranks. The implementation appears to leverage hardware-specific features for ring attention and parallel computations. Overall, this is a significant feature addition that enhances long-context capabilities. My review has identified a critical issue related to state management when reordering requests, which needs to be addressed.
…nto long_seq_pr # Conflicts: # vllm_ascend/models/deepseek_v2.py
45e9cda
to
1977acf
Compare
Signed-off-by: LookAround <[email protected]>
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: Apocalypse990923-qshi <[email protected]>
fix lint (part) + long_sequence_enable
Signed-off-by: Apocalypse990923-qshi <[email protected]>
merge remote-track main
Signed-off-by: weiguihua2 <[email protected]>
Signed-off-by: LookAround <[email protected]>
Signed-off-by: Apocalypse990923-qshi <[email protected]>
fix lint + mypy check
Signed-off-by: weiguihua2 <[email protected]>
Signed-off-by: LookAround <[email protected]>
Signed-off-by: SunnyLee219 <[email protected]>
Signed-off-by: SunnyLee219 <[email protected]>
Signed-off-by: weiguihua2 <[email protected]>
Signed-off-by: LookAround <[email protected]>
Signed-off-by: weiguihua2 <[email protected]>
Signed-off-by: SunnyLee219 <[email protected]>
Signed-off-by: weiguihua2 <[email protected]>
Signed-off-by: LookAround <[email protected]>
Signed-off-by: weiguihua2 <[email protected]>
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: tanwenqin <[email protected]>
plz make more details in pr message and link the related vllm pr here |
ae99fb6
to
566e57f
Compare
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
tensor_npu = _list_to_tensor(value, self.device) | ||
self.kv_idx_names[key] = tensor_npu | ||
|
||
# 处理序列长度相关张量 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to use English comments.
Signed-off-by: Delphine-Nic <[email protected]>
Merge main to 79a910e
What this PR does / why we need it?
Does this PR introduce any user-facing change?
How was this patch tested?