-
Notifications
You must be signed in to change notification settings - Fork 286
[GPU][QWen2-VL][QWen2.5-VL] improve SDPA performance with cu_seqlens and cu_window_seqlens #2330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GPU][QWen2-VL][QWen2.5-VL] improve SDPA performance with cu_seqlens and cu_window_seqlens #2330
Conversation
std::vector<int32_t> cu_seqlens; | ||
cu_seqlens.push_back(0); | ||
int32_t cumsum = 0; | ||
for (const auto& grid_thw : reordered_images_grid_thw) { | ||
size_t slice_len = grid_thw.at(1) * grid_thw.at(2); | ||
for (size_t t = 0; t < grid_thw.at(0); ++t) { | ||
cumsum += slice_len; | ||
cu_seqlens.push_back(cumsum); | ||
} | ||
} | ||
|
||
ov::Tensor t_cu_seqlens = ov::Tensor(ov::element::i32, {cu_seqlens.size()}); | ||
auto* ptr = static_cast<int32_t*>(t_cu_seqlens.data()); | ||
for (size_t n = 0; n < cu_seqlens.size(); n++) { | ||
ptr[n] = cu_seqlens[n]; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can create OV sub-graph for computation of cu_seqlens
from reordered_images_grid_thw
and attach to the main graph. I don't want us to have more computation in C++, rather to offload this work to plugin side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, sounds good. However, to convert attention_mask to cu_seqlens, we still have to compute attention_mask in C++ code, and cost more memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure in memory increase. If you have it, then you have it in your C++ implementation too.
Please elaborate concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still not resolved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I don't have it in C++ implementation. The memory of "attention_mask" is nonnegligible in long context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's compare the following two choices:
- c++ code of "get_attention_mask(reordered_images_grid_thw)" -> map attention_mask to cu_seqlens and execute it as a part of main graph -> use cu_seqlens in plugin
- c++ code of "get_cu_seqlens(reordered_images_grid_thw)" -> use cu_seqlens in plugin
Then you see the second choice is more preferable. Do you agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need to compute attention_mask. You should compute cu_seqlens
directly from reordered_images_grid_thw
.
For this you need to create ov::Model, compile it and execute in OV runtime.
You can also stitch this graph to the main one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got it. So it is better to make either get_cu_seqlens or get_attention_mask be part of graph. How about we handle this piece of preprocessing work in another ticket? This PR is mainly for VLSDPA optimizing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CVS-170675 created to follow up this issue.
return attention_mask; | ||
} | ||
|
||
ov::Tensor get_cu_seqlens(const std::vector<std::array<size_t, 3>>& reordered_images_grid_thw) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still not resolved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either attention_mask or cu_seqlens, it depends on plugin's capability to support VLSDPA. "cu_seqlens" is preferable to "attention_mask" in terms of memory efficiency. Why do we have to keep a redundant "attention_mask" and then map it to "cu_seqlens" which may negatively impact performance too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My ask is to do this computation in OV runtime, not doing it in C++. I do not propose to replace cu_seqlens with attention_mask. You can replace cu_seqlens
with reordered_images_grid_thw
ov::Tensor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CVS-170675 created to follow up this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check accuracy with https://github.com/openvinotoolkit/openvino.genai/tree/master/tools/who_what_benchmark triggering the mew implementation to ensure it doesn't break accuracy. We may end up in a situation like in ticket 170624 otherwise.
I have i9-12900K. Do I have a chance to trigger the new implementation?
Will do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances GPU performance for QWen2-VL and QWen2.5-VL models by implementing SDPA (Scaled Dot-Product Attention) optimizations using cumulative sequence lengths instead of attention masks. The optimization is automatically applied when the target device supports it.
- Introduces VLSDPA transformations that replace attention_mask inputs with cu_seqlens and cu_window_seqlens for better GPU performance
- Adds runtime detection to check if the compiled model supports the optimization
- Updates both QWen2-VL and QWen2.5-VL vision embeddings mergers to conditionally use the new input format
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
vl_sdpa_transformations.hpp | Declares functions for requesting and checking VLSDPA transformations |
vl_sdpa_transformations.cpp | Implements transformation request and detection logic |
qwen2vl/classes.hpp | Adds cu_seqlens support flag and function declaration |
qwen2vl/classes.cpp | Implements cu_seqlens generation and conditional input handling for QWen2-VL |
qwen2_5_vl/classes.cpp | Implements cu_window_seqlens generation and conditional input handling for QWen2.5-VL |
Comments suppressed due to low confidence (1)
src/cpp/src/visual_language/qwen2vl/classes.cpp:65
- [nitpick] The variable name 'm_with_cu_seqlens_input' could be more descriptive. Consider renaming to 'm_vlsdpa_optimization_enabled' or 'm_supports_cu_seqlens' to better convey its purpose.
spatial_merge_size,
…indow_seqlens (#30909) ### Details: The process is like, 1. GenAI provides a special RT_INFO entry to QWen-VL models during compile_model. 2. The plugin detects this entry and the target device capability. 3. The plugin transforms the model input, replacing attention_mask with cu_seqlens. 4. GenAI then performs inference after validating the final model inputs. ### Tickets: - *[168519](https://jira.devtools.intel.com/browse/CVS-168519)* Should work along with - openvinotoolkit/openvino.genai#2330 --------- Co-authored-by: River.Li <[email protected]> Co-authored-by: Pawel Raasz <[email protected]> Co-authored-by: Chen Peter <[email protected]>
…indow_seqlens (#30909) ### Details: The process is like, 1. GenAI provides a special RT_INFO entry to QWen-VL models during compile_model. 2. The plugin detects this entry and the target device capability. 3. The plugin transforms the model input, replacing attention_mask with cu_seqlens. 4. GenAI then performs inference after validating the final model inputs. ### Tickets: - *[168519](https://jira.devtools.intel.com/browse/CVS-168519)* Should work along with - openvinotoolkit/openvino.genai#2330 --------- Co-authored-by: River.Li <[email protected]> Co-authored-by: Pawel Raasz <[email protected]> Co-authored-by: Chen Peter <[email protected]>
Details:
The process is like,
Tickets:
Should work along with openvinotoolkit/openvino#30909