-
-
Notifications
You must be signed in to change notification settings - Fork 10.2k
[Model] Allow the use of sliding window in Qwen2 #17772
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: inkcherry <[email protected]>
Signed-off-by: inkcherry <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: inkcherry <[email protected]>
@@ -273,18 +287,6 @@ def __init__(self, | |||
cache_config = vllm_config.cache_config | |||
quant_config = vllm_config.quant_config | |||
|
|||
# TODO (@robertgshaw2): see if this can be moved out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment seems to come from @robertgshaw2-redhat , do you have any ideas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@inkcherry Thanks for the PR! You also need to update the code here to select the proper attention backend when interleaved attention is enabled.
Line 511 in 85b72cb
interleaved_attn_models = ["gemma2", "gemma3_text", "cohere2"] |
vllm/attention/layer.py
Outdated
assert per_layer_sliding_window > 0, ( | ||
f"per_layer_sliding_window must be positive or " | ||
f"{NOT_USE_SLIDING_WINDOW} (to force disable)") | ||
sliding_window = per_layer_sliding_window |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need to change this file? Can you just pass per_layer_sliding_window=None for not use sliding window?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the review, If per_layer_sliding_window is None, cache_config will be fallback used to determine the value, rather than setting sliding_window to None.
vllm/model_executor/models/utils.py
Outdated
if max_window_layers is None or layer_idx >= max_window_layers: | ||
return sliding_window | ||
|
||
return NOT_USE_SLIDING_WINDOW |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think max_window_layers
is only used in qwen series models now. Can you put this function in qwen.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, moved.
@inkcherry Do we really need to support interleaved sliding window? Both QWQ32B and Qwen2.5-7B-Instruct in the issue you linked have the same |
Signed-off-by: inkcherry <[email protected]>
@heheda12345 , Yes, this assertion is triggered by the default setting: https://huggingface.co/Qwen/Qwen2.5-7B-Instruct/blob/main/config.json#L13. |
I think enabling sliding window is a common need but adjusting max window layer is not.
What about adding an assertion for |
Signed-off-by: inkcherry <[email protected]>
Signed-off-by: inkcherry <[email protected]>
Sure, updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the improvement.
Signed-off-by: inkcherry <[email protected]> Signed-off-by: Yuqi Zhang <[email protected]>
Signed-off-by: inkcherry <[email protected]> Signed-off-by: minpeter <[email protected]>
wait... In transformers' implementation if self.layer_types is None: |
Qwen2 doesn't have sliding window by default. What we want to support is use cache_config.window_size to force running the model with sliding window. non-default max_window_layers is out of scope of this PR. |
Does Qwen3 support sliding window? How could I enable it? I was surprised that vllm doesn't produce Thanks :) |
Qwen3 doesn't have sliding window by default. You can try whether you can force enable it by setting |
Somehow pass
For sliding window, is it possible to terminate when the number of response tokens hits some number? Otherwise, it's possible that still long prompt exhausts Basically, trying to understand the semantics of |
Support the use of sliding window in certain layers of Qwen
FIX #17306
FIX #15705