-
Notifications
You must be signed in to change notification settings - Fork 459
[bugfix] fix shared expert dp with hybrid kvcache #2964
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bugfix] fix shared expert dp with hybrid kvcache #2964
Conversation
Signed-off-by: linfeng-yuan <[email protected]>
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a bug in handling attention metadata for shared expert data parallelism with hybrid KV cache. The proposed change correctly retrieves attention metadata from a dictionary. However, the implementation uses a hardcoded key for layer 0, which is incorrect for other layers and can lead to critical errors in multi-layer models. I've suggested a fix to dynamically construct the key using the current layer's index, ensuring the correct metadata is always used.
|
||
attn_metadata = get_forward_context().attn_metadata | ||
if attn_metadata is not None and isinstance(attn_metadata, dict): | ||
attn_metadata = attn_metadata['model.layers.0.self_attn.attn'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a hardcoded key 'model.layers.0.self_attn.attn'
to access attention metadata is incorrect. This will fetch metadata for layer 0 regardless of the current layer being processed, which can lead to erroneous behavior, especially in multi-layer models. The key should be constructed dynamically using the current layer's index (self.layer_idx
) to ensure the correct metadata is used.
attn_metadata = attn_metadata['model.layers.0.self_attn.attn'] | |
attn_metadata = attn_metadata[f"model.layers.{self.layer_idx}.self_attn.attn"] |
please update the commit message to explain why e2e test passed and should we update the e2e as well? |
I've updated the commit message and plan to add this ST before this weekend~ |
What this PR does / why we need it?
#2849 moves the implementation of
shared_expert_dp
to torchair deepseek_modeling. However, the calling ofset_forward_context
withenforce_eager
andshared_expert_dp
falls back to the implementation of model_runner_v1.py and set the global attn_metadata as a dictionary. It leads to a RuntimerError when attn_metadata is got from the forward context and used in torchair_deepseek_v2.py. This PR fixes this problem by introducing the transformation of attn_metadata in this file.Note that current E2E testing lacks the case of deepseek with
shared_expert_dp
. We need to add an ST withshared_expert_dp
in testing workflow.Does this PR introduce any user-facing change?
No.
How was this patch tested?
e2e vllm serving with
enable_shared_expert_dp: true
passed.