Skip to content

Conversation

zhangsicheng5
Copy link
Collaborator

Merge main to 79a910e

1Fire4 and others added 10 commits September 17, 2025 12:00
### What this PR does / why we need it?
Add an option of enable  frozen parameter

### How was this patch tested?

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@68dbde5

Signed-off-by: 1Fire4 <[email protected]>
### What this PR does / why we need it?
vllm-project#2849 moves the
implementation of `shared_expert_dp` to torchair deepseek_modeling.
However, the calling of `set_forward_context` with `enforce_eager` and
`shared_expert_dp` falls back to the implementation of
model_runner_v1.py and set the global attn_metadata as a dictionary. It
leads to a RuntimerError when attn_metadata is got from the forward
context and used in torchair_deepseek_v2.py. This PR fixes this problem
by introducing the transformation of attn_metadata in this file.

Note that current E2E testing lacks the case of deepseek with
`shared_expert_dp`. We need to add an ST with `shared_expert_dp` in
testing workflow.

### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
e2e vllm serving with `enable_shared_expert_dp: true` passed.

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@de3e53a

Signed-off-by: linfeng-yuan <[email protected]>
when `enable_kv_nz` is true, output of Deepseek R1 is invalid.

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@2b85697

Signed-off-by: realliujiaxu <[email protected]>
### What this PR does / why we need it?
Added a new connector for Mooncake store integration to enable kvcache
reuse in scenarios with system prompts or multi-turn dialogues.

### How was this patch tested?


- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@5963b98

---------

Signed-off-by: LCAIZJ <[email protected]>
Signed-off-by: fems14 <[email protected]>
Co-authored-by: fems14 <[email protected]>
Co-authored-by: Dreamerleader <[email protected]>
Co-authored-by: Pz1116 <[email protected]>
Co-authored-by: lizy124 <[email protected]>
Co-authored-by: zouyida2052 <[email protected]>
### What this PR does / why we need it?
This PR depends on the merge of vllm-project#2707 and has adapted the aclgraph
functionality to support MTP.

### How was this patch tested?


- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@2b85697

---------

Signed-off-by: xuyexiong <[email protected]>
vllm-project#2901)

### What this PR does / why we need it?
[Bugfix]:replace npu_incre_flash_attention with
npu_fused_infer_attention_score in order to be able to tiling update

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?


- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@2b85697

Signed-off-by: p00465316 <[email protected]>
Co-authored-by: p00465316 <[email protected]>
### What this PR does / why we need it?
The current linear.py has the following issues:

- There is redundant conditional logic in the `comm_group` and `forward`
selection for classes such as `AscendMergedColumnParallelLinear`.

- Inconsistent comm_group selection logic exists among
`AscendMergedColumnParallelLinear`, `AscendColumnParallelLinear`, and
`AscendQKVParallelLinear`.

To address these two issues, this PR encapsulates `comm_group` and
`forward` into classes and extracts the classes selection logic into
common functions. For future additions of custom communication groups or
forward methods, it will only be necessary to extend
`CustomColumnParallelOp` or `CustomRowParallelOp` and add new selection
logic.

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?


- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@dd39baf

---------

Signed-off-by: realliujiaxu <[email protected]>
Co-authored-by: weijinqian0 <[email protected]>
### What this PR does / why we need it?
Add multi-node ray backend tutorial for Qwen235B-A3B

### How was this patch tested?

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@f4cd80f

---------

Signed-off-by: wangli <[email protected]>
vllm-project#2681)

This pr fixes two problems while `multistream_moe` enabled in torchair
graph mode:
1. check `TorchairAscendW8A8DynamicFusedMoEMethod` instead of incorrect
`AscendW8A8DynamicFusedMoEMethod`
2. mc2_mask should be chunked no matter `replace_allreduce` is True or
False in forward function of `TorchairAscendFusedMoE`

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@0fb2551

Signed-off-by: linfeng-yuan <[email protected]>
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@github-actions github-actions bot added documentation Improvements or additions to documentation module:core module:tests module:ops labels Sep 18, 2025
@LookAround0301 LookAround0301 merged commit 2f5102f into LookAround0301:long_seq_pr Sep 19, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation module:core module:ops module:tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants