Skip to content

Conversation

inkcherry
Copy link
Contributor

@inkcherry inkcherry commented May 7, 2025

Support the use of sliding window in certain layers of Qwen

FIX #17306
FIX #15705

inkcherry added 2 commits May 7, 2025 07:44
Signed-off-by: inkcherry <[email protected]>
Signed-off-by: inkcherry <[email protected]>
Copy link

github-actions bot commented May 7, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: inkcherry <[email protected]>
@@ -273,18 +287,6 @@ def __init__(self,
cache_config = vllm_config.cache_config
quant_config = vllm_config.quant_config

# TODO (@robertgshaw2): see if this can be moved out
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment seems to come from @robertgshaw2-redhat , do you have any ideas?

Copy link
Collaborator

@heheda12345 heheda12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@inkcherry Thanks for the PR! You also need to update the code here to select the proper attention backend when interleaved attention is enabled.

interleaved_attn_models = ["gemma2", "gemma3_text", "cohere2"]

assert per_layer_sliding_window > 0, (
f"per_layer_sliding_window must be positive or "
f"{NOT_USE_SLIDING_WINDOW} (to force disable)")
sliding_window = per_layer_sliding_window
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to change this file? Can you just pass per_layer_sliding_window=None for not use sliding window?

Copy link
Contributor Author

@inkcherry inkcherry May 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the review, If per_layer_sliding_window is None, cache_config will be fallback used to determine the value, rather than setting sliding_window to None.

if max_window_layers is None or layer_idx >= max_window_layers:
return sliding_window

return NOT_USE_SLIDING_WINDOW
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think max_window_layers is only used in qwen series models now. Can you put this function in qwen.py?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, moved.

@heheda12345
Copy link
Collaborator

@inkcherry Do we really need to support interleaved sliding window? Both QWQ32B and Qwen2.5-7B-Instruct in the issue you linked have the same num_hidden_layers and max_window_layers, meaning that max_window_layers has no effect.

Signed-off-by: inkcherry <[email protected]>
@inkcherry
Copy link
Contributor Author

inkcherry commented May 11, 2025

@heheda12345 , Yes, this assertion is triggered by the default setting: https://huggingface.co/Qwen/Qwen2.5-7B-Instruct/blob/main/config.json#L13.
But it's a configurable variable, defined in the model config: https://github.com/huggingface/transformers/blob/716819b8309324302e00a3488a3c3d6faa427f79/src/transformers/models/qwen2/modeling_qwen2.py#L175
This assert crash prevent users from making further modification attempts this feature possibly.

@heheda12345
Copy link
Collaborator

I think enabling sliding window is a common need but adjusting max window layer is not.
It needs more discussion about implementing max_window_layers as people may want the following 3 modes and we need to design the API to differentiate them.

  1. all layers as full attention
  2. first max_window_layers as sliding window and the following layers as full attention
  3. all layers as sliding window attention.

What about adding an assertion for num_hidden_layers==max_window_layers when sliding window is enabled and remove the logic for interleaved sliding window? I think it is enough to fix the two issues you mentioned.

inkcherry added 2 commits May 12, 2025 02:44
Signed-off-by: inkcherry <[email protected]>
Signed-off-by: inkcherry <[email protected]>
@inkcherry
Copy link
Contributor Author

inkcherry commented May 12, 2025

I think enabling sliding window is a common need but adjusting max window layer is not. It needs more discussion about implementing max_window_layers as people may want the following 3 modes and we need to design the API to differentiate them.

  1. all layers as full attention
  2. first max_window_layers as sliding window and the following layers as full attention
  3. all layers as sliding window attention.

What about adding an assertion for num_hidden_layers==max_window_layers when sliding window is enabled and remove the logic for interleaved sliding window? I think it is enough to fix the two issues you mentioned.

Sure, updated.
By the way, the bottom layers were using SWA, but we ignored this setting.
Currently, only None or full SWA is supported via use_sliding_window

@heheda12345 heheda12345 changed the title Support the use of sliding window in certain layers [Model] Allow the use of sliding window in Qwen2 May 14, 2025
Copy link
Collaborator

@heheda12345 heheda12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the improvement.

@heheda12345 heheda12345 enabled auto-merge (squash) May 14, 2025 12:04
@heheda12345 heheda12345 added the ready ONLY add when PR is ready to merge/full CI is needed label May 14, 2025
@vllm-bot vllm-bot merged commit dd2a945 into vllm-project:main May 15, 2025
45 of 47 checks passed
zzzyq pushed a commit to zzzyq/vllm that referenced this pull request May 24, 2025
minpeter pushed a commit to minpeter/vllm that referenced this pull request Jun 24, 2025
@Yufei-Z
Copy link

Yufei-Z commented Aug 13, 2025

LGTM! Thanks for the improvement.

I think enabling sliding window is a common need but adjusting max window layer is not. It needs more discussion about implementing max_window_layers as people may want the following 3 modes and we need to design the API to differentiate them.

  1. all layers as full attention
  2. first max_window_layers as sliding window and the following layers as full attention
  3. all layers as sliding window attention.

What about adding an assertion for num_hidden_layers==max_window_layers when sliding window is enabled and remove the logic for interleaved sliding window? I think it is enough to fix the two issues you mentioned.

wait... In transformers' implementation
https://github.com/huggingface/transformers/blob/89c46b648d82b670cc7286a25fa64d2d92770418/src/transformers/models/qwen3/configuration_qwen3.py#L210C9-L216C14

if self.layer_types is None:
self.layer_types = [
"sliding_attention"
if self.sliding_window is not None and i >= self.max_window_layers
else "full_attention"
for i in range(self.num_hidden_layers)
]
num_hidden_layers==max_window_layers means sliding window takes no effect

@heheda12345
Copy link
Collaborator

Qwen2 doesn't have sliding window by default. What we want to support is use cache_config.window_size to force running the model with sliding window. non-default max_window_layers is out of scope of this PR.

@vadimkantorov
Copy link

Qwen2 doesn't have sliding window by default. What we want to support is use cache_config.window_size to force running the model with sliding window.

Does Qwen3 support sliding window? How could I enable it?

I was surprised that vllm doesn't produce max_tokens output tokens when max_model_len is small, or if prompt is large (and consumes a lot of or even larger than max_model_len tokens)

Thanks :)

@heheda12345
Copy link
Collaborator

Qwen3 doesn't have sliding window by default. You can try whether you can force enable it by setting sliding_window. Request will end either the total length reaches max-model-len or the output len reaches max-tokens.

@vadimkantorov
Copy link

vadimkantorov commented Aug 18, 2025

enable it by setting sliding_window

Somehow pass sliding_window in vllm.LLM(...) args?

Request will end either the total length reaches max-model-len or the output len reaches max-tokens.

For sliding window, is it possible to terminate when the number of response tokens hits some number? Otherwise, it's possible that still long prompt exhausts max-model-len and no response tokens would be generated - which is strange given that this should be technically possible with sliding window

Basically, trying to understand the semantics of sliding_window in vllm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
6 participants