[GPU][QWen2-VL][QWen2.5-VL] improve SDPA performance with cu_seqlens and cu_window_seqlens #2330

ceciliapeng2011 · 2025-06-10T09:36:22Z

Details:

The process is like,

GenAI provides a special RT_INFO entry to QWen-VL models during compile_model.
The plugin detects this entry and the target device capability.
The plugin transforms the model input, replacing attention_mask with cu_seqlens.
GenAI then performs inference after validating the final model inputs.

Tickets:

168519

Should work along with openvinotoolkit/openvino#30909

src/cpp/src/visual_language/qwen2_5_vl/classes.cpp

src/cpp/src/visual_language/qwen2vl/classes.cpp

src/cpp/src/visual_language/vl_sdpa_transformations.cpp

src/cpp/src/visual_language/qwen2vl/classes.cpp

src/cpp/src/visual_language/vl_sdpa_transformations.hpp

src/cpp/src/visual_language/qwen2vl/classes.cpp

rkazants · 2025-07-11T10:20:36Z

src/cpp/src/visual_language/qwen2vl/classes.cpp

+    std::vector<int32_t> cu_seqlens;
+    cu_seqlens.push_back(0);
+    int32_t cumsum = 0;
+    for (const auto& grid_thw : reordered_images_grid_thw) {
+        size_t slice_len = grid_thw.at(1) * grid_thw.at(2);
+        for (size_t t = 0; t < grid_thw.at(0); ++t) {
+            cumsum += slice_len;
+            cu_seqlens.push_back(cumsum);
+        }
+    }
+
+    ov::Tensor t_cu_seqlens = ov::Tensor(ov::element::i32, {cu_seqlens.size()});
+    auto* ptr = static_cast<int32_t*>(t_cu_seqlens.data());
+    for (size_t n = 0; n < cu_seqlens.size(); n++) {
+        ptr[n] = cu_seqlens[n];
+    }


I think we can create OV sub-graph for computation of cu_seqlens from reordered_images_grid_thw and attach to the main graph. I don't want us to have more computation in C++, rather to offload this work to plugin side.

hmm, sounds good. However, to convert attention_mask to cu_seqlens, we still have to compute attention_mask in C++ code, and cost more memory.

Not sure in memory increase. If you have it, then you have it in your C++ implementation too.
Please elaborate concern.

Still not resolved?

No, I don't have it in C++ implementation. The memory of "attention_mask" is nonnegligible in long context.

Let's compare the following two choices:

c++ code of "get_attention_mask(reordered_images_grid_thw)" -> map attention_mask to cu_seqlens and execute it as a part of main graph -> use cu_seqlens in plugin

c++ code of "get_cu_seqlens(reordered_images_grid_thw)" -> use cu_seqlens in plugin

Then you see the second choice is more preferable. Do you agree?

you don't need to compute attention_mask. You should compute cu_seqlens directly from reordered_images_grid_thw.
For this you need to create ov::Model, compile it and execute in OV runtime.
You can also stitch this graph to the main one.

I got it. So it is better to make either get_cu_seqlens or get_attention_mask be part of graph. How about we handle this piece of preprocessing work in another ticket? This PR is mainly for VLSDPA optimizing.

CVS-170675 created to follow up this issue.

src/cpp/src/visual_language/vl_sdpa_transformations.hpp

rkazants · 2025-07-14T08:43:37Z

src/cpp/src/visual_language/qwen2vl/classes.cpp

    return attention_mask;
 }

+ov::Tensor get_cu_seqlens(const std::vector<std::array<size_t, 3>>& reordered_images_grid_thw) {


still not resolved?

Either attention_mask or cu_seqlens, it depends on plugin's capability to support VLSDPA. "cu_seqlens" is preferable to "attention_mask" in terms of memory efficiency. Why do we have to keep a redundant "attention_mask" and then map it to "cu_seqlens" which may negatively impact performance too?

My ask is to do this computation in OV runtime, not doing it in C++. I do not propose to replace cu_seqlens with attention_mask. You can replace cu_seqlens with reordered_images_grid_thw ov::Tensor

CVS-170675 created to follow up this issue.

Co-authored-by: Copilot <[email protected]>

Wovchena

Please check accuracy with https://github.com/openvinotoolkit/openvino.genai/tree/master/tools/who_what_benchmark triggering the mew implementation to ensure it doesn't break accuracy. We may end up in a situation like in ticket 170624 otherwise.

I have i9-12900K. Do I have a chance to trigger the new implementation?

ceciliapeng2011 · 2025-07-18T06:11:35Z

Please check accuracy with https://github.com/openvinotoolkit/openvino.genai/tree/master/tools/who_what_benchmark triggering the mew implementation to ensure it doesn't break accuracy. We may end up in a situation like in ticket 170624 otherwise.

I have i9-12900K. Do I have a chance to trigger the new implementation?

Will do

Copilot

Pull Request Overview

This PR enhances GPU performance for QWen2-VL and QWen2.5-VL models by implementing SDPA (Scaled Dot-Product Attention) optimizations using cumulative sequence lengths instead of attention masks. The optimization is automatically applied when the target device supports it.

Introduces VLSDPA transformations that replace attention_mask inputs with cu_seqlens and cu_window_seqlens for better GPU performance
Adds runtime detection to check if the compiled model supports the optimization
Updates both QWen2-VL and QWen2.5-VL vision embeddings mergers to conditionally use the new input format

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
vl_sdpa_transformations.hpp	Declares functions for requesting and checking VLSDPA transformations
vl_sdpa_transformations.cpp	Implements transformation request and detection logic
qwen2vl/classes.hpp	Adds cu_seqlens support flag and function declaration
qwen2vl/classes.cpp	Implements cu_seqlens generation and conditional input handling for QWen2-VL
qwen2_5_vl/classes.cpp	Implements cu_window_seqlens generation and conditional input handling for QWen2.5-VL

Comments suppressed due to low confidence (1)

src/cpp/src/visual_language/qwen2vl/classes.cpp:65

[nitpick] The variable name 'm_with_cu_seqlens_input' could be more descriptive. Consider renaming to 'm_vlsdpa_optimization_enabled' or 'm_supports_cu_seqlens' to better convey its purpose.

        spatial_merge_size,

src/cpp/src/visual_language/vl_sdpa_transformations.cpp

src/cpp/src/visual_language/qwen2_5_vl/classes.cpp

…indow_seqlens (#30909) ### Details: The process is like, 1. GenAI provides a special RT_INFO entry to QWen-VL models during compile_model. 2. The plugin detects this entry and the target device capability. 3. The plugin transforms the model input, replacing attention_mask with cu_seqlens. 4. GenAI then performs inference after validating the final model inputs. ### Tickets: - *[168519](https://jira.devtools.intel.com/browse/CVS-168519)* Should work along with - openvinotoolkit/openvino.genai#2330 --------- Co-authored-by: River.Li <[email protected]> Co-authored-by: Pawel Raasz <[email protected]> Co-authored-by: Chen Peter <[email protected]>

riverlijunjie added 2 commits May 29, 2025 18:18

vlsdpa

cc06ff2

support qwen2.5 vl

ecb3687

ceciliapeng2011 marked this pull request as draft June 10, 2025 09:36

github-actions bot added the category: visual language Visual language pipeline label Jun 10, 2025

ceciliapeng2011 mentioned this pull request Jun 10, 2025

[GPU][QWen2/2.5-VL] improve SDPA performance with cu_seqlens and cu_window_seqlens openvinotoolkit/openvino#30909

Merged

ceciliapeng2011 added 3 commits June 16, 2025 20:15

code clean & enable VLSDPA by default.

e9545ed

Merge branch 'master' into cecilia/vlsdpa

571a734

refactor to signal a rt_info of model_type

c36bb87

ceciliapeng2011 marked this pull request as ready for review July 9, 2025 03:37

ceciliapeng2011 requested review from sshlyapn, luo-cheng2021 and itikhono July 9, 2025 03:38

ceciliapeng2011 assigned sshlyapn Jul 9, 2025

ceciliapeng2011 requested a review from Wovchena July 9, 2025 03:41

ceciliapeng2011 assigned Wovchena Jul 9, 2025

ceciliapeng2011 removed request for sshlyapn and itikhono July 9, 2025 03:42

ceciliapeng2011 unassigned sshlyapn Jul 9, 2025

ceciliapeng2011 requested a review from sshlyapn July 9, 2025 03:42

ceciliapeng2011 added this to the 2025.3 milestone Jul 9, 2025

Wovchena requested a review from Copilot July 9, 2025 09:53

Wovchena requested changes Jul 9, 2025

View reviewed changes

This comment was marked as outdated.

Sign in to view

Wovchena requested a review from rkazants July 9, 2025 13:07

rkazants reviewed Jul 11, 2025

View reviewed changes

src/cpp/src/visual_language/qwen2vl/classes.cpp Outdated Show resolved Hide resolved

rkazants reviewed Jul 11, 2025

View reviewed changes

src/cpp/src/visual_language/vl_sdpa_transformations.hpp Outdated Show resolved Hide resolved

ceciliapeng2011 added 2 commits July 14, 2025 13:36

Merge branch 'master' into cecilia/vlsdpa

0d6f08f

update to review comments

483c0af

ceciliapeng2011 requested a review from rkazants July 14, 2025 08:41

ceciliapeng2011 requested a review from Wovchena July 14, 2025 08:41

rkazants reviewed Jul 14, 2025

View reviewed changes

ceciliapeng2011 and others added 2 commits July 15, 2025 14:12

Update src/cpp/src/visual_language/qwen2vl/classes.cpp

f56d4f6

Co-authored-by: Copilot <[email protected]>

Merge branch 'master' into cecilia/vlsdpa

91686cf

ceciliapeng2011 requested a review from rkazants July 15, 2025 09:04

Wovchena reviewed Jul 16, 2025

View reviewed changes

ceciliapeng2011 added 7 commits July 22, 2025 04:20

add log info vl_sdpa_transformations

c9d26a4

Merge branch 'master' into cecilia/vlsdpa

aaa3af3

Merge branch 'master' into cecilia/vlsdpa

8dc5ed1

Merge branch 'master' into cecilia/vlsdpa

d79f132

Merge branch 'master' into cecilia/vlsdpa

b5f8bea

with VLSDPA enabled/disabled log

880875e

Merge branch 'master' into cecilia/vlsdpa

f87f9a6

ceciliapeng2011 requested a review from Wovchena July 31, 2025 10:45

Wovchena approved these changes Jul 31, 2025

View reviewed changes

Wovchena requested a review from Copilot July 31, 2025 10:54

Copilot AI reviewed Jul 31, 2025

View reviewed changes

src/cpp/src/visual_language/vl_sdpa_transformations.cpp Show resolved Hide resolved

src/cpp/src/visual_language/qwen2_5_vl/classes.cpp Show resolved Hide resolved

Wovchena added this pull request to the merge queue Aug 1, 2025

Merged via the queue into openvinotoolkit:master with commit 759fa08 Aug 1, 2025
86 checks passed

[GPU][QWen2-VL][QWen2.5-VL] improve SDPA performance with cu_seqlens and cu_window_seqlens #2330

[GPU][QWen2-VL][QWen2.5-VL] improve SDPA performance with cu_seqlens and cu_window_seqlens #2330

Uh oh!

Conversation

ceciliapeng2011 commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details:

Tickets:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ceciliapeng2011 Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ceciliapeng2011 Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ceciliapeng2011 Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ceciliapeng2011 Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rkazants Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Wovchena left a comment

Choose a reason for hiding this comment

Uh oh!

ceciliapeng2011 commented Jul 18, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ceciliapeng2011 commented Jun 10, 2025 •

edited

Loading

ceciliapeng2011 Jul 14, 2025 •

edited

Loading

ceciliapeng2011 Jul 14, 2025 •

edited

Loading

ceciliapeng2011 Jul 15, 2025 •

edited

Loading

ceciliapeng2011 Jul 14, 2025 •

edited

Loading

rkazants Jul 14, 2025 •

edited

Loading