Skip to content

Conversation

linfeng-yuan
Copy link
Collaborator

@linfeng-yuan linfeng-yuan commented Sep 16, 2025

What this PR does / why we need it?

#2849 moves the implementation of shared_expert_dp to torchair deepseek_modeling. However, the calling of set_forward_context with enforce_eager and shared_expert_dp falls back to the implementation of model_runner_v1.py and set the global attn_metadata as a dictionary. It leads to a RuntimerError when attn_metadata is got from the forward context and used in torchair_deepseek_v2.py. This PR fixes this problem by introducing the transformation of attn_metadata in this file.

Note that current E2E testing lacks the case of deepseek with shared_expert_dp. We need to add an ST with shared_expert_dp in testing workflow.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

e2e vllm serving with enable_shared_expert_dp: true passed.

Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug in handling attention metadata for shared expert data parallelism with hybrid KV cache. The proposed change correctly retrieves attention metadata from a dictionary. However, the implementation uses a hardcoded key for layer 0, which is incorrect for other layers and can lead to critical errors in multi-layer models. I've suggested a fix to dynamically construct the key using the current layer's index, ensuring the correct metadata is always used.


attn_metadata = get_forward_context().attn_metadata
if attn_metadata is not None and isinstance(attn_metadata, dict):
attn_metadata = attn_metadata['model.layers.0.self_attn.attn']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Using a hardcoded key 'model.layers.0.self_attn.attn' to access attention metadata is incorrect. This will fetch metadata for layer 0 regardless of the current layer being processed, which can lead to erroneous behavior, especially in multi-layer models. The key should be constructed dynamically using the current layer's index (self.layer_idx) to ensure the correct metadata is used.

Suggested change
attn_metadata = attn_metadata['model.layers.0.self_attn.attn']
attn_metadata = attn_metadata[f"model.layers.{self.layer_idx}.self_attn.attn"]

@wangxiyuan
Copy link
Collaborator

please update the commit message to explain why e2e test passed and should we update the e2e as well?

@linfeng-yuan
Copy link
Collaborator Author

please update the commit message to explain why e2e test passed and should we update the e2e as well?

I've updated the commit message and plan to add this ST before this weekend~

@wangxiyuan wangxiyuan added ready read for review ready-for-test start test by label for PR labels Sep 17, 2025
@wangxiyuan wangxiyuan merged commit 8bcc0cc into vllm-project:main Sep 17, 2025
44 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready read for review ready-for-test start test by label for PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants