[Model] Allow the use of sliding window in Qwen2 #17772

inkcherry · 2025-05-07T08:01:07Z

Support the use of sliding window in certain layers of Qwen

Signed-off-by: inkcherry <[email protected]>

github-actions · 2025-05-07T08:01:18Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: inkcherry <[email protected]>

youkaichao · 2025-05-07T09:18:52Z

vllm/model_executor/models/qwen2.py

@@ -273,18 +287,6 @@ def __init__(self,
        cache_config = vllm_config.cache_config
        quant_config = vllm_config.quant_config

-        # TODO (@robertgshaw2): see if this can be moved out


this comment seems to come from @robertgshaw2-redhat , do you have any ideas?

heheda12345

@inkcherry Thanks for the PR! You also need to update the code here to select the proper attention backend when interleaved attention is enabled.

vllm/vllm/config.py

Line 511 in 85b72cb

interleaved_attn_models = ["gemma2", "gemma3_text", "cohere2"]

heheda12345 · 2025-05-09T16:06:57Z

vllm/attention/layer.py

+                assert per_layer_sliding_window > 0, (
+                    f"per_layer_sliding_window must be positive or "
+                    f"{NOT_USE_SLIDING_WINDOW} (to force disable)")
+                sliding_window = per_layer_sliding_window


Why do you need to change this file? Can you just pass per_layer_sliding_window=None for not use sliding window?

thanks for the review, If per_layer_sliding_window is None, cache_config will be fallback used to determine the value, rather than setting sliding_window to None.

heheda12345 · 2025-05-09T16:10:13Z

vllm/model_executor/models/utils.py

+    if max_window_layers is None or layer_idx >= max_window_layers:
+        return sliding_window
+
+    return NOT_USE_SLIDING_WINDOW


I think max_window_layers is only used in qwen series models now. Can you put this function in qwen.py?

thanks, moved.

heheda12345 · 2025-05-11T08:58:49Z

@inkcherry Do we really need to support interleaved sliding window? Both QWQ32B and Qwen2.5-7B-Instruct in the issue you linked have the same num_hidden_layers and max_window_layers, meaning that max_window_layers has no effect.

Signed-off-by: inkcherry <[email protected]>

inkcherry · 2025-05-11T09:39:08Z

@heheda12345 , Yes, this assertion is triggered by the default setting: https://huggingface.co/Qwen/Qwen2.5-7B-Instruct/blob/main/config.json#L13.
But it's a configurable variable, defined in the model config: https://github.com/huggingface/transformers/blob/716819b8309324302e00a3488a3c3d6faa427f79/src/transformers/models/qwen2/modeling_qwen2.py#L175
This assert crash prevent users from making further modification attempts this feature possibly.

heheda12345 · 2025-05-11T14:47:02Z

I think enabling sliding window is a common need but adjusting max window layer is not.
It needs more discussion about implementing max_window_layers as people may want the following 3 modes and we need to design the API to differentiate them.

all layers as full attention
first max_window_layers as sliding window and the following layers as full attention
all layers as sliding window attention.

What about adding an assertion for num_hidden_layers==max_window_layers when sliding window is enabled and remove the logic for interleaved sliding window? I think it is enough to fix the two issues you mentioned.

Signed-off-by: inkcherry <[email protected]>

inkcherry · 2025-05-12T04:05:20Z

I think enabling sliding window is a common need but adjusting max window layer is not. It needs more discussion about implementing max_window_layers as people may want the following 3 modes and we need to design the API to differentiate them.

all layers as full attention

first max_window_layers as sliding window and the following layers as full attention

all layers as sliding window attention.

What about adding an assertion for num_hidden_layers==max_window_layers when sliding window is enabled and remove the logic for interleaved sliding window? I think it is enough to fix the two issues you mentioned.

Sure, updated.
By the way, the bottom layers were using SWA, but we ignored this setting.
Currently, only None or full SWA is supported via use_sliding_window

heheda12345

LGTM! Thanks for the improvement.

Signed-off-by: inkcherry <[email protected]> Signed-off-by: Yuqi Zhang <[email protected]>

Signed-off-by: inkcherry <[email protected]> Signed-off-by: minpeter <[email protected]>

Yufei-Z · 2025-08-13T06:43:44Z

LGTM! Thanks for the improvement.

I think enabling sliding window is a common need but adjusting max window layer is not. It needs more discussion about implementing max_window_layers as people may want the following 3 modes and we need to design the API to differentiate them.

all layers as full attention

first max_window_layers as sliding window and the following layers as full attention

all layers as sliding window attention.

What about adding an assertion for num_hidden_layers==max_window_layers when sliding window is enabled and remove the logic for interleaved sliding window? I think it is enough to fix the two issues you mentioned.

wait... In transformers' implementation
https://github.com/huggingface/transformers/blob/89c46b648d82b670cc7286a25fa64d2d92770418/src/transformers/models/qwen3/configuration_qwen3.py#L210C9-L216C14

if self.layer_types is None:
self.layer_types = [
"sliding_attention"
if self.sliding_window is not None and i >= self.max_window_layers
else "full_attention"
for i in range(self.num_hidden_layers)
]
num_hidden_layers==max_window_layers means sliding window takes no effect

heheda12345 · 2025-08-13T18:58:20Z

Qwen2 doesn't have sliding window by default. What we want to support is use cache_config.window_size to force running the model with sliding window. non-default max_window_layers is out of scope of this PR.

vadimkantorov · 2025-08-18T11:56:44Z

Qwen2 doesn't have sliding window by default. What we want to support is use cache_config.window_size to force running the model with sliding window.

Does Qwen3 support sliding window? How could I enable it?

I was surprised that vllm doesn't produce max_tokens output tokens when max_model_len is small, or if prompt is large (and consumes a lot of or even larger than max_model_len tokens)

Thanks :)

heheda12345 · 2025-08-18T18:51:27Z

Qwen3 doesn't have sliding window by default. You can try whether you can force enable it by setting sliding_window. Request will end either the total length reaches max-model-len or the output len reaches max-tokens.

vadimkantorov · 2025-08-18T20:01:34Z

enable it by setting sliding_window

Somehow pass sliding_window in vllm.LLM(...) args?

Request will end either the total length reaches max-model-len or the output len reaches max-tokens.

For sliding window, is it possible to terminate when the number of response tokens hits some number? Otherwise, it's possible that still long prompt exhausts max-model-len and no response tokens would be generated - which is strange given that this should be technically possible with sliding window

Basically, trying to understand the semantics of sliding_window in vllm

inkcherry added 2 commits May 7, 2025 07:44

update

8ee5ff3

Signed-off-by: inkcherry <[email protected]>

use constant

57fd105

Signed-off-by: inkcherry <[email protected]>

format

a214e2a

Signed-off-by: inkcherry <[email protected]>

youkaichao reviewed May 7, 2025

View reviewed changes

heheda12345 reviewed May 9, 2025

View reviewed changes

update

0a0e447

Signed-off-by: inkcherry <[email protected]>

inkcherry added 2 commits May 12, 2025 02:44

revert

cc9b656

Signed-off-by: inkcherry <[email protected]>

update

d067c05

Signed-off-by: inkcherry <[email protected]>

heheda12345 changed the title ~~Support the use of sliding window in certain layers~~ [Model] Allow the use of sliding window in Qwen2 May 14, 2025

heheda12345 approved these changes May 14, 2025

View reviewed changes

heheda12345 enabled auto-merge (squash) May 14, 2025 12:04

heheda12345 added the ready ONLY add when PR is ready to merge/full CI is needed label May 14, 2025

vllm-bot merged commit dd2a945 into vllm-project:main May 15, 2025
45 of 47 checks passed

zzzyq pushed a commit to zzzyq/vllm that referenced this pull request May 24, 2025

[Model] Allow the use of sliding window in Qwen2 (vllm-project#17772)

04e3fdd

Signed-off-by: inkcherry <[email protected]> Signed-off-by: Yuqi Zhang <[email protected]>

minpeter pushed a commit to minpeter/vllm that referenced this pull request Jun 24, 2025

[Model] Allow the use of sliding window in Qwen2 (vllm-project#17772)

9c6cb15

Signed-off-by: inkcherry <[email protected]> Signed-off-by: minpeter <[email protected]>

vadimkantorov mentioned this pull request Aug 18, 2025

[Feature]: return graceful inference text input validation errors as part of output (without throwing an exception) - to enable skipping / handling bad examples after the processing of good ones #16732

Open

1 task

Uh oh!

[Model] Allow the use of sliding window in Qwen2 #17772

[Model] Allow the use of sliding window in Qwen2 #17772

Uh oh!

Conversation

inkcherry commented May 7, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 7, 2025

Uh oh!

youkaichao May 7, 2025

Choose a reason for hiding this comment

Uh oh!

heheda12345 left a comment

Choose a reason for hiding this comment

Uh oh!

heheda12345 May 9, 2025

Choose a reason for hiding this comment

Uh oh!

inkcherry May 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

heheda12345 May 9, 2025

Choose a reason for hiding this comment

Uh oh!

inkcherry May 11, 2025

Choose a reason for hiding this comment

Uh oh!

heheda12345 commented May 11, 2025

Uh oh!

inkcherry commented May 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

heheda12345 commented May 11, 2025

Uh oh!

inkcherry commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

heheda12345 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Yufei-Z commented Aug 13, 2025

Uh oh!

heheda12345 commented Aug 13, 2025

Uh oh!

vadimkantorov commented Aug 18, 2025

Uh oh!

heheda12345 commented Aug 18, 2025

Uh oh!

vadimkantorov commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

inkcherry commented May 7, 2025 •

edited by github-actions bot

Loading

inkcherry May 11, 2025 •

edited

Loading

inkcherry commented May 11, 2025 •

edited

Loading

inkcherry commented May 12, 2025 •

edited

Loading

vadimkantorov commented Aug 18, 2025 •

edited

Loading