Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
3afb35b
[POC] implement cdpruner for qwen2.5-vl
liangali Aug 1, 2025
38879b5
Enhance CDPruner and RelevanceCalculator to support negative relevanc…
yangwang201911 Aug 5, 2025
5bedef4
Update CDPruner configuration to enable negative relevance for CLIP-b…
yangwang201911 Aug 5, 2025
c81af98
Add support for subgraph in CDPruner and ConditionalKernelBuilder
yangwang201911 Aug 6, 2025
4c7e1c0
Update L2 normalization function
yangwang201911 Aug 6, 2025
5c1b678
Skip updating marginal gains for already selected tokens in FastGreed…
yangwang201911 Aug 6, 2025
4ac2a1c
Enhance ConditionalKernelBuilder to precompile models and create infe…
yangwang201911 Aug 7, 2025
1d2ff66
Enhance CDPruner and RelevanceCalculator to support negative relevanc…
yangwang201911 Aug 5, 2025
95c243f
Add CDPruner configuration parameters to GenerationConfig
yangwang201911 Aug 13, 2025
221456b
Implement GPU model compilation in constructor.
yangwang201911 Aug 13, 2025
79d7955
Refactor CDPruner configuration: rename debug_mode to pruning_debug_m…
yangwang201911 Aug 14, 2025
79529fa
Enhance CDPruner configuration: add pruning parameters to command-lin…
yangwang201911 Aug 14, 2025
99f55b0
Merge remote-tracking branch 'upstream' into ywang2/enable_cdpruner_c…
yangwang201911 Aug 14, 2025
6a4a332
Refactor CDPruner configuration: remove unused settings and streamlin…
yangwang201911 Aug 14, 2025
5ba4d7d
update format
yangwang201911 Aug 15, 2025
b618463
Merge branch 'master' of https://github.com/openvinotoolkit/openvino.…
yangwang201911 Aug 15, 2025
e63d071
Merge branch 'ywang2/enable_cdpruner_config' into ywang2/vlm-cdpruner
yangwang201911 Aug 15, 2025
ebf1a18
Refactor pruning debug mode checks and enable ops model by default
yangwang201911 Aug 15, 2025
087d1c8
Add logging for CDPruner configuration
yangwang201911 Aug 18, 2025
81fcf68
Add logging for CDPruner configuration settings
yangwang201911 Aug 18, 2025
cc89a26
Rename visual_tokens_percentage to viusal_tokens_retain_percentage ac…
yangwang201911 Aug 18, 2025
05e7e65
Initialize CDPruner with default configuration in VisionEncoder const…
yangwang201911 Aug 19, 2025
0429022
1. Corrected "viusal_tokens_retain_percentage" to "visual_tokens_reta…
yangwang201911 Aug 19, 2025
c1e1f45
Add debug logging for conditional kernel matrix and marginal gains in…
yangwang201911 Aug 19, 2025
2cb1e8f
update.
yangwang201911 Aug 19, 2025
572b251
Merge branch 'master' of https://github.com/openvinotoolkit/openvino.…
yangwang201911 Aug 21, 2025
26b29f7
[visual_language_chat] Add CDPruner options and update usage instruct…
yangwang201911 Aug 22, 2025
c0280eb
Enhance CDPruner functionality with new ops model option and update r…
yangwang201911 Aug 22, 2025
b2f2601
Refactor CDPruner debug output for consistency and clarity in logging
yangwang201911 Aug 22, 2025
9452f2f
Optimize orthogonal vector computation: reduce memory access and impr…
yangwang201911 Aug 26, 2025
4d5b4d3
Refactor update_marginal_gains method: remove unused selected_idx par…
yangwang201911 Aug 28, 2025
96039d0
optimize projection calculations and improve performance
yangwang201911 Aug 28, 2025
acbec3d
Optimize orthogonal vector update: implement manual loop unrolling fo…
yangwang201911 Aug 29, 2025
1777f71
Optimize orthogonal vector update: improve zero-check precision and e…
yangwang201911 Sep 1, 2025
9627723
Optimize orthogonal vector update: remove manual loop unrolling for i…
yangwang201911 Sep 1, 2025
ad90d91
Optimize token selection in CDPruner: split visual tokens into two ha…
yangwang201911 Sep 1, 2025
c85c1a6
Fix NaN handling in update_marginal_gains: improve stability by skipp…
yangwang201911 Sep 3, 2025
f59b04f
Add SIMD optimizations for vector operations in FastGreedyDPP: implem…
yangwang201911 Sep 4, 2025
ffd8dfd
Optimize token selection in CDPruner: implement parallel DPP selectio…
yangwang201911 Sep 5, 2025
abcda0a
Enhance DPP timing in CDPruner: initialize and report DPP selection d…
yangwang201911 Sep 5, 2025
1b6cc0a
Refactor CDPruner and KernelBuilder: streamline device selection for …
yangwang201911 Sep 8, 2025
dd6d9cf
Add parallel DPP selection functionality in CDPruner: implement helpe…
yangwang201911 Sep 9, 2025
867adaf
update.
yangwang201911 Sep 9, 2025
4bd5489
apply pruner for multi images
zhaixuejun1993 Sep 18, 2025
4e3f5ea
Add OpenCL support for CDPruner DPP acceleration
yangwang201911 Sep 11, 2025
e824a43
Refactor CDPruner and FastGreedyDPP: remove create_pruning_mask and r…
yangwang201911 Sep 18, 2025
9f60871
Refactor CDPruner integration: remove enable_pruning argument and str…
yangwang201911 Sep 18, 2025
187ddbc
refactor cl kernel codes
xipingyan Sep 19, 2025
ff515e5
reduce one param
xipingyan Sep 19, 2025
aa09366
Merge branch 'ywang2/vlm-cdpruner' into xuejun/pruner_multi_inputs
yangwang201911 Sep 19, 2025
f5811a7
Merge pull request #1 from zhaixuejun1993/xuejun/pruner_multi_inputs
yangwang201911 Sep 19, 2025
14e09b8
Enhance CDPruner functionality: add support for multi-frame pruning, …
yangwang201911 Sep 19, 2025
93310c4
Merge branch 'master' into ywang2/vlm-cdpruner
yangwang201911 Sep 19, 2025
e31689d
Update default pruning_ratio to 0 in CDPruner configuration and bench…
yangwang201911 Sep 19, 2025
aa46bc0
Merge pull request #2 from xipingyan/xp/cdpruner_refactor_kernel
yangwang201911 Sep 19, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 20 additions & 2 deletions samples/cpp/visual_language_chat/benchmark_vlm.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,8 @@ int main(int argc, char* argv[]) try {
("n,num_iter", "Number of iterations", cxxopts::value<size_t>()->default_value(std::to_string(3)))
("mt,max_new_tokens", "Maximal number of new tokens", cxxopts::value<size_t>()->default_value(std::to_string(20)))
("d,device", "device", cxxopts::value<std::string>()->default_value("CPU"))
("pr,pruning_ratio", "Percentage of visual tokens to prune when CDPruner is enabled", cxxopts::value<size_t>()->default_value("50"))
("pdm,pruning_debug_mode", "Enable pruning debug mode", cxxopts::value<bool>()->default_value("false"))
("h,help", "Print usage");

cxxopts::ParseResult result;
Expand Down Expand Up @@ -57,18 +59,34 @@ int main(int argc, char* argv[]) try {
std::string device = result["device"].as<std::string>();
size_t num_warmup = result["num_warmup"].as<size_t>();
size_t num_iter = result["num_iter"].as<size_t>();
size_t pruning_ratio = result["pruning_ratio"].as<size_t>();
bool pruning_debug_mode = result["pruning_debug_mode"].as<bool>();
std::vector<ov::Tensor> images = utils::load_images(image_path);

ov::genai::GenerationConfig config;
config.max_new_tokens = result["max_new_tokens"].as<size_t>();
config.ignore_eos = true;

config.pruning_ratio = pruning_ratio;
// Configure CDPruner if requested
if (pruning_ratio > 0 && pruning_ratio < 100) {
std::cout << "[CDPruner] Enabling CDPruner with pruning ratio " << pruning_ratio << "% visual tokens" << std::endl;
config.pruning_debug_mode = pruning_debug_mode;
}

std::cout << ov::get_openvino_version() << std::endl;

// Setup cache configuration for CDPruner if needed
ov::AnyMap properties = {};
if (pruning_ratio > 0 && pruning_ratio < 100) {
properties.insert({"ATTENTION_BACKEND", "PA"});
std::cout << "[CDPruner] Setting ATTENTION_BACKEND to PA for CDPruner" << std::endl;
}

std::unique_ptr<ov::genai::VLMPipeline> pipe;
if (device == "NPU")
if (device == "NPU") {
pipe = std::make_unique<ov::genai::VLMPipeline>(models_path, device);
else {
} else {
// Setting of Scheduler config will trigger usage of ContinuousBatching pipeline, which is not default for Qwen2VL, Qwen2.5VL, Gemma3 due to accuracy issues.
ov::genai::SchedulerConfig scheduler_config;
scheduler_config.enable_prefix_caching = false;
Expand Down
32 changes: 24 additions & 8 deletions samples/cpp/visual_language_chat/visual_language_chat.cpp
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
// Copyright (C) 2024 Intel Corporation
// Copyright (C) 2024 Intel Corporation
// SPDX-License-Identifier: Apache-2.0

#include "load_image.hpp"
Expand All @@ -10,25 +10,41 @@ bool print_subword(std::string&& subword) {
}

int main(int argc, char* argv[]) try {
if (argc < 3 || argc > 4) {
throw std::runtime_error(std::string{"Usage "} + argv[0] + " <MODEL_DIR> <IMAGE_FILE OR DIR_WITH_IMAGES> <DEVICE>");
if (3 > argc || argc > 6) {
throw std::runtime_error(std::string{"Usage: "} + argv[0] + " <MODEL_DIR> <IMAGE_FILE> [<DEVICE>] [<PRUNING_RATIO>] [<PRUNING_DEBUG_MODE>]");
}

std::vector<ov::Tensor> rgbs = utils::load_images(argv[2]);
std::string model_dir = argv[1];
std::string image_file = argv[2];
std::string device = argc > 3 ? argv[3] : "CPU";
size_t pruning_ratio = argc > 4 ? std::stoul(argv[4]) : 0; // 0 means disabled
bool pruning_debug_mode = argc > 5 ? (std::string(argv[5]) == "true" || std::string(argv[5]) == "1") : false;

std::vector<ov::Tensor> rgbs = utils::load_images(image_file);

// GPU and NPU can be used as well.
// Note: If NPU is selected, only language model will be run on NPU
std::string device = (argc == 4) ? argv[3] : "CPU";
ov::AnyMap enable_compile_cache;
if (device == "GPU") {
// Cache compiled models on disk for GPU to save time on the
// next run. It's not beneficial for CPU.
enable_compile_cache.insert({ov::cache_dir("vlm_cache")});
}
ov::genai::VLMPipeline pipe(argv[1], device, enable_compile_cache);

if (pruning_ratio > 0) {
enable_compile_cache.insert({"ATTENTION_BACKEND", "PA"});
std::cout << "[CDPruner] Setting ATTENTION_BACKEND to PA" << std::endl;
}

// Initialize VLMPipeline with cache configuration if needed
ov::genai::VLMPipeline pipe(model_dir, device, enable_compile_cache);

ov::genai::GenerationConfig generation_config;
generation_config.max_new_tokens = 100;
generation_config.pruning_ratio = pruning_ratio;
// Configure CDPruner if requested
if (pruning_ratio > 0) {
std::cout << "[CDPruner] Enabling CDPruner with " << pruning_ratio << "% visual token pruning" << std::endl;
generation_config.pruning_debug_mode = pruning_debug_mode;
}

std::string prompt;

Expand Down
11 changes: 11 additions & 0 deletions samples/python/visual_language_chat/benchmark_vlm.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,9 @@ def main():
parser.add_argument("-n", "--num_iter", type=int, default=2, help="Number of iterations")
parser.add_argument("-mt", "--max_new_tokens", type=int, default=20, help="Maximal number of new tokens")
parser.add_argument("-d", "--device", type=str, default="CPU", help="Device")
parser.add_argument("--pruning_ratio", type=int, default=0, help="Percentage of visual tokens to prune (0 to disable)")
parser.add_argument("--pruning_debug_mode", action="store_true", help="Enable debugging mode for pruning")
parser.add_argument("--relevance_weight", type=float, help="Relevance weight for the model")

args = parser.parse_args()

Expand All @@ -68,6 +71,14 @@ def main():

config = ov_genai.GenerationConfig()
config.max_new_tokens = args.max_new_tokens
config.pruning_ratio = args.pruning_ratio if args.pruning_ratio is not None else 0
print(f'CDPruner config: Pruning ratio - {config.pruning_ratio}% (0 means disabled)')
if config.pruning_ratio > 0:
if args.relevance_weight is not None:
config.relevance_weight = args.relevance_weight
if args.pruning_debug_mode:
config.pruning_debug_mode = args.pruning_debug_mode
print(f'CDPruner config: Pruning debug mode - {config.pruning_debug_mode}')

if device == "NPU":
pipe = ov_genai.VLMPipeline(models_path, device)
Expand Down
34 changes: 34 additions & 0 deletions src/cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,19 @@ list(APPEND SOURCE_FILES "${CMAKE_CURRENT_BINARY_DIR}/version.cpp")

include(FetchContent)

# OpenCL support for CDPruner DPP acceleration - reuse OpenVINO's ENABLE_SYSTEM_OPENCL
if(ENABLE_SYSTEM_OPENCL)
# Try to find OpenCL since ENABLE_SYSTEM_OPENCL is ON
find_package(OpenCL QUIET)
if(TARGET OpenCL::OpenCL)
message(STATUS "OpenCL found via OpenVINO configuration - enabling CDPruner DPP acceleration")
else()
message(STATUS "ENABLE_SYSTEM_OPENCL is ON but OpenCL::OpenCL target not found - CDPruner will use CPU-only DPP")
endif()
else()
message(STATUS "ENABLE_SYSTEM_OPENCL is OFF - CDPruner will use CPU-only DPP implementation")
endif()

if(NOT TARGET nlohmann_json)
FetchContent_Declare(nlohmann_json
URL https://github.com/nlohmann/json/archive/refs/tags/v3.11.3.tar.gz
Expand Down Expand Up @@ -143,6 +156,12 @@ target_include_directories(${TARGET_NAME_OBJ} SYSTEM PRIVATE "${safetensors.h_SO

target_link_libraries(${TARGET_NAME_OBJ} PRIVATE openvino::runtime openvino::threading nlohmann_json::nlohmann_json minja)

# Add OpenCL support if enabled via OpenVINO configuration
if(ENABLE_SYSTEM_OPENCL AND TARGET OpenCL::OpenCL)
target_compile_definitions(${TARGET_NAME_OBJ} PRIVATE ENABLE_OPENCL_DPP)
target_link_libraries(${TARGET_NAME_OBJ} PRIVATE OpenCL::OpenCL)
endif()

target_compile_features(${TARGET_NAME_OBJ} PRIVATE cxx_std_17)

target_compile_definitions(${TARGET_NAME_OBJ} PRIVATE openvino_genai_EXPORTS)
Expand All @@ -152,6 +171,16 @@ if(MSVC)
target_compile_options(${TARGET_NAME_OBJ} PRIVATE "/bigobj")
endif()

# Add native CPU optimization for SIMD instructions
if(CMAKE_CXX_COMPILER_ID MATCHES "GNU|Clang")
# Force AVX2 only (disable AVX512)
target_compile_options(${TARGET_NAME_OBJ} PRIVATE "-mavx2" "-mno-avx512f")
elseif(MSVC)
if(CMAKE_SYSTEM_PROCESSOR MATCHES "AMD64|x86_64")
target_compile_options(${TARGET_NAME_OBJ} PRIVATE "/arch:AVX2")
endif()
endif()

set_target_properties(${TARGET_NAME_OBJ} PROPERTIES POSITION_INDEPENDENT_CODE ON)

# Shared library
Expand All @@ -169,6 +198,11 @@ target_include_directories(${TARGET_NAME} INTERFACE "$<INSTALL_INTERFACE:runtime

target_link_libraries(${TARGET_NAME} PUBLIC openvino::runtime PRIVATE openvino::threading nlohmann_json::nlohmann_json minja ${CMAKE_DL_LIBS})

# Add OpenCL support if enabled via OpenVINO configuration
if(ENABLE_SYSTEM_OPENCL AND TARGET OpenCL::OpenCL)
target_link_libraries(${TARGET_NAME} PRIVATE OpenCL::OpenCL)
endif()

if(ENABLE_XGRAMMAR)
target_link_libraries(${TARGET_NAME} PRIVATE xgrammar)
endif()
Expand Down
15 changes: 15 additions & 0 deletions src/cpp/include/openvino/genai/generation_config.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -271,6 +271,11 @@ operator|(const StructuredOutputConfig::CompoundGrammar& lhs,
* @param top_k the number of highest probability vocabulary tokens to keep for top-k-filtering.
* @param rng_seed initializes random generator.
*
* CDPruner configuration:
* @param pruning_ratio the percentage of visual tokens to prune (0-100). Set to 0 to disable pruning.
* @param relevance_weight the weight of relevance for visual tokens.
* @param pruning_debug_mode whether to enable pruning debug mode.
*
* Assisting generation parameters:
* @param assistant_confidence_threshold the lower token probability of candidate to be validated by main model in case of dynamic strategy candidates number update.
* @param num_assistant_tokens the defined candidates number to be generated by draft model/prompt lookup in case of static strategy candidates number update.
Expand Down Expand Up @@ -321,6 +326,11 @@ class OPENVINO_GENAI_EXPORTS GenerationConfig {
bool do_sample = false;
size_t rng_seed = 0;

// CDPruner config
size_t pruning_ratio = 0; // 0 means disabled, 1-100 means percentage to prune
float relevance_weight = 0.5f;
bool pruning_debug_mode = false;

// Assisting generation parameters
float assistant_confidence_threshold = 0.f;
size_t num_assistant_tokens = 0;
Expand Down Expand Up @@ -392,6 +402,11 @@ static constexpr ov::Property<float> repetition_penalty{"repetition_penalty"};
static constexpr ov::Property<int64_t> eos_token_id{"eos_token_id"};
static constexpr ov::Property<float> presence_penalty{"presence_penalty"};
static constexpr ov::Property<float> frequency_penalty{"frequency_penalty"};

static constexpr ov::Property<size_t> pruning_ratio{"pruning_ratio"};
static constexpr ov::Property<float> relevance_weight{"relevance_weight"};
static constexpr ov::Property<bool> pruning_debug_mode{"pruning_debug_mode"};

extern OPENVINO_GENAI_EXPORTS ov::Property<size_t> rng_seed;

static constexpr ov::Property<float> assistant_confidence_threshold{"assistant_confidence_threshold"};
Expand Down
6 changes: 6 additions & 0 deletions src/cpp/src/continuous_batching/pipeline_base.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -164,6 +164,12 @@ ContinuousBatchingPipeline::IContinuousBatchingPipeline::generate(
std::vector<VLMPerfMetrics> vlm_perf_metrics(prompts.size());
std::vector<EncodedImage> encoded_images = {};

const auto& generation_config = sampling_params[0];
// Set visual token pruning configuration
m_inputs_embedder->set_visual_token_pruning_config(generation_config.pruning_ratio,
generation_config.relevance_weight,
generation_config.pruning_debug_mode);

if (m_is_chat_conversation) {
OPENVINO_ASSERT(1 == prompts.size(), "Can't chat with multiple prompts");
const auto& rgbs = rgbs_vector[0];
Expand Down
Loading
Loading