Skip to content

Conversation

ceciliapeng2011
Copy link
Contributor

@ceciliapeng2011 ceciliapeng2011 commented Jun 10, 2025

Details:

The process is like,

  1. GenAI provides a special RT_INFO entry to QWen-VL models during compile_model.
  2. The plugin detects this entry and the target device capability.
  3. The plugin transforms the model input, replacing attention_mask with cu_seqlens.
  4. GenAI then performs inference after validating the final model inputs.

Tickets:

Should work along with openvinotoolkit/openvino#30909

Copilot

This comment was marked as outdated.

@Wovchena Wovchena requested a review from rkazants July 9, 2025 13:07
Comment on lines +325 to +340
std::vector<int32_t> cu_seqlens;
cu_seqlens.push_back(0);
int32_t cumsum = 0;
for (const auto& grid_thw : reordered_images_grid_thw) {
size_t slice_len = grid_thw.at(1) * grid_thw.at(2);
for (size_t t = 0; t < grid_thw.at(0); ++t) {
cumsum += slice_len;
cu_seqlens.push_back(cumsum);
}
}

ov::Tensor t_cu_seqlens = ov::Tensor(ov::element::i32, {cu_seqlens.size()});
auto* ptr = static_cast<int32_t*>(t_cu_seqlens.data());
for (size_t n = 0; n < cu_seqlens.size(); n++) {
ptr[n] = cu_seqlens[n];
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can create OV sub-graph for computation of cu_seqlens from reordered_images_grid_thw and attach to the main graph. I don't want us to have more computation in C++, rather to offload this work to plugin side.

Copy link
Contributor Author

@ceciliapeng2011 ceciliapeng2011 Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, sounds good. However, to convert attention_mask to cu_seqlens, we still have to compute attention_mask in C++ code, and cost more memory.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure in memory increase. If you have it, then you have it in your C++ implementation too.
Please elaborate concern.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not resolved?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't have it in C++ implementation. The memory of "attention_mask" is nonnegligible in long context.

Copy link
Contributor Author

@ceciliapeng2011 ceciliapeng2011 Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's compare the following two choices:

  1. c++ code of "get_attention_mask(reordered_images_grid_thw)" -> map attention_mask to cu_seqlens and execute it as a part of main graph -> use cu_seqlens in plugin
  2. c++ code of "get_cu_seqlens(reordered_images_grid_thw)" -> use cu_seqlens in plugin

Then you see the second choice is more preferable. Do you agree?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to compute attention_mask. You should compute cu_seqlens directly from reordered_images_grid_thw.
For this you need to create ov::Model, compile it and execute in OV runtime.
You can also stitch this graph to the main one.

Copy link
Contributor Author

@ceciliapeng2011 ceciliapeng2011 Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got it. So it is better to make either get_cu_seqlens or get_attention_mask be part of graph. How about we handle this piece of preprocessing work in another ticket? This PR is mainly for VLSDPA optimizing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CVS-170675 created to follow up this issue.

@ceciliapeng2011 ceciliapeng2011 requested a review from rkazants July 14, 2025 08:41
@ceciliapeng2011 ceciliapeng2011 requested a review from Wovchena July 14, 2025 08:41
return attention_mask;
}

ov::Tensor get_cu_seqlens(const std::vector<std::array<size_t, 3>>& reordered_images_grid_thw) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still not resolved?

Copy link
Contributor Author

@ceciliapeng2011 ceciliapeng2011 Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either attention_mask or cu_seqlens, it depends on plugin's capability to support VLSDPA. "cu_seqlens" is preferable to "attention_mask" in terms of memory efficiency. Why do we have to keep a redundant "attention_mask" and then map it to "cu_seqlens" which may negatively impact performance too?

Copy link
Collaborator

@rkazants rkazants Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My ask is to do this computation in OV runtime, not doing it in C++. I do not propose to replace cu_seqlens with attention_mask. You can replace cu_seqlens with reordered_images_grid_thw ov::Tensor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CVS-170675 created to follow up this issue.

@ceciliapeng2011 ceciliapeng2011 requested a review from rkazants July 15, 2025 09:04
Copy link
Collaborator

@Wovchena Wovchena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check accuracy with https://github.com/openvinotoolkit/openvino.genai/tree/master/tools/who_what_benchmark triggering the mew implementation to ensure it doesn't break accuracy. We may end up in a situation like in ticket 170624 otherwise.

I have i9-12900K. Do I have a chance to trigger the new implementation?

@ceciliapeng2011
Copy link
Contributor Author

Please check accuracy with https://github.com/openvinotoolkit/openvino.genai/tree/master/tools/who_what_benchmark triggering the mew implementation to ensure it doesn't break accuracy. We may end up in a situation like in ticket 170624 otherwise.

I have i9-12900K. Do I have a chance to trigger the new implementation?

Will do

@ceciliapeng2011 ceciliapeng2011 requested a review from Wovchena July 31, 2025 10:45
@Wovchena Wovchena requested a review from Copilot July 31, 2025 10:54
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances GPU performance for QWen2-VL and QWen2.5-VL models by implementing SDPA (Scaled Dot-Product Attention) optimizations using cumulative sequence lengths instead of attention masks. The optimization is automatically applied when the target device supports it.

  • Introduces VLSDPA transformations that replace attention_mask inputs with cu_seqlens and cu_window_seqlens for better GPU performance
  • Adds runtime detection to check if the compiled model supports the optimization
  • Updates both QWen2-VL and QWen2.5-VL vision embeddings mergers to conditionally use the new input format

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
vl_sdpa_transformations.hpp Declares functions for requesting and checking VLSDPA transformations
vl_sdpa_transformations.cpp Implements transformation request and detection logic
qwen2vl/classes.hpp Adds cu_seqlens support flag and function declaration
qwen2vl/classes.cpp Implements cu_seqlens generation and conditional input handling for QWen2-VL
qwen2_5_vl/classes.cpp Implements cu_window_seqlens generation and conditional input handling for QWen2.5-VL
Comments suppressed due to low confidence (1)

src/cpp/src/visual_language/qwen2vl/classes.cpp:65

  • [nitpick] The variable name 'm_with_cu_seqlens_input' could be more descriptive. Consider renaming to 'm_vlsdpa_optimization_enabled' or 'm_supports_cu_seqlens' to better convey its purpose.
        spatial_merge_size,          

@Wovchena Wovchena added this pull request to the merge queue Aug 1, 2025
Merged via the queue into openvinotoolkit:master with commit 759fa08 Aug 1, 2025
86 checks passed
github-merge-queue bot pushed a commit to openvinotoolkit/openvino that referenced this pull request Aug 12, 2025
…indow_seqlens (#30909)

### Details:
 The process is like,

1. GenAI provides a special RT_INFO entry to QWen-VL models during
compile_model.
2.  The plugin detects this entry and the target device capability.
3. The plugin transforms the model input, replacing attention_mask with
cu_seqlens.
4. GenAI then performs inference after validating the final model
inputs.


### Tickets:
 - *[168519](https://jira.devtools.intel.com/browse/CVS-168519)*

Should work along with -
openvinotoolkit/openvino.genai#2330

---------

Co-authored-by: River.Li <[email protected]>
Co-authored-by: Pawel Raasz <[email protected]>
Co-authored-by: Chen Peter <[email protected]>
github-merge-queue bot pushed a commit to openvinotoolkit/openvino that referenced this pull request Aug 12, 2025
…indow_seqlens (#30909)

### Details:
 The process is like,

1. GenAI provides a special RT_INFO entry to QWen-VL models during
compile_model.
2.  The plugin detects this entry and the target device capability.
3. The plugin transforms the model input, replacing attention_mask with
cu_seqlens.
4. GenAI then performs inference after validating the final model
inputs.


### Tickets:
 - *[168519](https://jira.devtools.intel.com/browse/CVS-168519)*

Should work along with -
openvinotoolkit/openvino.genai#2330

---------

Co-authored-by: River.Li <[email protected]>
Co-authored-by: Pawel Raasz <[email protected]>
Co-authored-by: Chen Peter <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: visual language Visual language pipeline
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants