What's Changed
- Bump product version 2025.3 by @akladiev in #2255
- Implement SnapKV by @vshampor in #2067
- [WWB] Additional processing of native phi4mm by @nikita-savelyevv in #2276
- Update ov genai version in samples by @as-suvorov in #2275
- use chat templates in vlm by @eaidova in #2279
- Fix 'Unsupported property' fails if set prompt_lookup to False by @sbalandi in #2240
- Force the PA implementation in the llm-bench by default by @sbalandi in #2271
- Update Whisper README.md as "--disable-stateful" is no longer required to export models for NPU by @luke-lin-vmc in #2249
- Removed 'slices' from EncodedImage by @popovaan in #2258
- support text embeddings in llm_bench by @eaidova in #2269
- [wwb]: load transformers model first, then only trust_remote_code by @eaidova in #2270
- [GHA] Coverity pipeline fixes by @mryzhov in #2283
- [GHA][DEV] Fixed coverity path creation by @mryzhov in #2285
- [GHA][DEV]Save coveity tool to cache by @mryzhov in #2286
- [GHA][DEV] Set cache key for coverity tool by @mryzhov in #2288
- Image generation multiconcurrency (#2190) by @dkalinowski in #2284
- [GGUF] Support GGUF format for tokenizers and detokenizers by @rkazants in #2263
- Unskip whisper tests & update optimum-intel by @as-suvorov in #2247
- Update README.md with text-to-speech by @rkazants in #2294
- add new chat template for qwen3 by @eaidova in #2297
- [DOCS] Correct cmd-line for TTS conversion by @rkazants in #2303
- [GHA] Enabled product manifest.yml by @mryzhov in #2281
- Bump the npm_and_yarn group across 3 directories with 2 updates by @dependabot[bot] in #2309
- [GHA] Save artifacts to cloud share by @akladiev in #1943
- [GHA][[COVERITY] added manual trigger by @mryzhov in https://github.com//pull/2289
- [GHA] Fix missing condition for Extract Artifacts step by @akladiev in #2313
- [llm bench] Turn off PA backend for VLM by @sbalandi in #2312
- [llm_bench] Add setting of max_num_batched_tokens for SchedulerConfig by @sbalandi in #2316
- [GHA] Fix missing condition for LLM & VLM test by @sammysun0711 in #2326
- [Test] Skip gguf test on MacOS due to sporadic failure by @sammysun0711 in #2328
- [GGUF] support Qwen3 architecture by @TianmengChen in #2273
- [llm_bench] Increase max_num_batched_tokens to the largest positive integer by @sbalandi in #2327
- Bump aquasecurity/trivy-action from 0.30.0 to 0.31.0 by @dependabot[bot] in #2310
- Bump actions/download-artifact from 4.1.9 to 4.3.0 by @dependabot[bot] in #2315
- Fix system_message forwarding by @Wovchena in #2325
- Disabled crop of the prompt for minicpmv. by @andreyanufr in #2320
- [llm bench] Fix setting ATTENTION_BACKEND to plugin config in case of fallback to Optimum by @sbalandi in #2332
- Bump brace-expansion from 2.0.1 to 2.0.2 in /.github/actions/install_wheel in the npm_and_yarn group across 1 directory by @dependabot[bot] in #2338
- Fix multinomial sampling for PromptLookupDecoding by @sbalandi in #2331
- [llm bench] Avoid using not supported by beam_search parameters for beam_search case by @sbalandi in #2336
- Update Export Requirements by @apaniukov in #2342
- [GGUF] Serialize Generated OV Model for Faster LLMPipeline Init by @sammysun0711 in #2218
- Fixed system message in chat mode. by @popovaan in #2343
- Bump librosa from 0.10.2.post1 to 0.11.0 in /samples by @dependabot[bot] in #2346
- [Test][GGUF] Add DeepSeek-R1-Distill-Qwen GGUF in CI by @sammysun0711 in #2329
- [llm_bench] Remove default scheduler config by @sbalandi in #2341
- master: add Phi-4-multimodal-instruct by @Wovchena in #2264
- Fix paths with unicode for tokenizers by @yatarkan in #2337
- [WWB] Add try-except block for processor loading by @nikita-savelyevv in #2352
- [WWB] Bring back eager attention implementation by default by @nikita-savelyevv in #2353
- fix supported models link in TTS samples by @eaidova in #2300
- StaticLLMPipeline: Add tests on caching by @smirnov-alexey in #1905
- [WWB] Fix loading the tokenizer for VLMs by @l-bat in #2351
- Pass Scheduler Config for VLM Pipeline in WhoWhatBenchmark. by @popovaan in #2318
- Remove misspelled CMAKE_CURRENT_SOUCE_DIR by @Wovchena in #2362
- Increase timeout for LLM & VLM by @Wovchena in #2359
- [llm bench] Fix setting ATTENTION_BACKEND to plugin config in case of fallback to Optimum for VLM by @sbalandi in #2361
- Support multi images for vlm benchmarking in samples and llm_bench by @wgzintel in #2197
- CB: Hetero pipeline parallel support by @WeldonWangwang in #2227
- Update conversion instructions by @adrianboguszewski in #2287
- Merge stderr from failed samples by @Wovchena in #2156
- Revert cache folder by @Wovchena in #2372
- Update README in Node.js API by @almilosz in #2374
- [Docs] Rework home page by @yatarkan in #2368
- Align PromptLookupDecoding with greedy when dynamic_split_fuse works by @sbalandi in #2360
- Support to collect latency for transformers V4.52.0 by @wgzintel in #2373
- Bump diffusers from 0.33.1 to 0.34.0 in /samples by @dependabot[bot] in #2381
- Bump diffusers from 0.33.1 to 0.34.0 in /tests/python_tests by @dependabot[bot] in #2380
- Structured Output generation with
XGrammar
by @pavel-esir in #2295 - Disable XGrammar on Android by @apaniukov in #2389
- [wwb] Take prompts from different categories for def dataset for VLM by @sbalandi in #2349
- Fix for cloning NPU Image Generation pipelines (#2376) by @dkalinowski in #2393
- Set add_special_tokens=false for image tags in MiniCPM. by @popovaan in #2404
- Fix missing use cases for inpainting models and defining use case with relative path by @sbalandi in #2387
- temporary skip failing whisper tests by @pavel-esir in #2396
- Fix test_vlm_npu_no_exception by @AlexanderKalistratov in #2388
- Bump timm from 1.0.15 to 1.0.16 by @dependabot[bot] in #2390
- Optimize VisionEncoderQwen2VL::encode by @usstq in #2205
- Update calculation of TPOT for cases with non equitable batch size by @sbalandi in #2377
- Bump pillow from 11.2.1 to 11.3.0 in /samples in the pip group by @dependabot[bot] in #2399
- Bump aquasecurity/trivy-action from 0.31.0 to 0.32.0 by @dependabot[bot] in #2405
- Fixed vLLM failure on NPU. by @intelgaoxiong in #2407
- Fix debug build for
XGrammar
on Windows by @pavel-esir in #2413 - Add extended perf metrics for speculative decoding by @sbalandi in #2321
- [JS] Implement TextEmbeddingPipeline in JS API by @Retribution98 in #2385
- Remove nightly test label by @Wovchena in #2408
- Remove NPU from lora by @Wovchena in #2418
- Add performance metrics for StructuredOutput generation by @pavel-esir in #2398
- Bump timm from 1.0.16 to 1.0.17 by @dependabot[bot] in #2420
- Install torchcodec by @Wovchena in #2422
- [CI] [GHA] Add other Python versions to build wheels and Python API by @akashchi in #2384
- [CI][GHA] Increase Python (Cacheopt E2E) timeout for manylinux_2_28 by @sammysun0711 in #2437
- Add attention mask for decoder whisper model by @eshiryae in #2018
- [Docs] Align supported models, remove
SUPPORTED_MODELS.md
by @yatarkan in #2435 - Implement sparse attention prefill by @vshampor in #2299
- Add C++ sample for Structured output generation by @pavel-esir in #2423
- Increase timeout by @Wovchena in #2443
- [llm_bench] Fix typos for 1st token info and names of memory monitoring phase by @sbalandi in #2444
- Cache get_pil_image_by_link() by @Wovchena in #2445
- LoRA lm_head and embed_tokens constants support by @likholat in #2395
- Fixed compitibility OV version by @mryzhov in #2442
- Increase whisper timeout by @Wovchena in #2448
- Temporarily disable failed ImageGeneration and LoRA tests by @likholat in #2450
- Fix SDL tests by @sammysun0711 in #2451
- benchmark_vlm: Fixed vLLM failure on NPU by @JeevakaPrabu in #2446
- Add Structural Tags with XGrammar Backend by @apaniukov in #2411
- [Docs] Resolve dependabot issues by @yatarkan in #2447
- Bump the npm_and_yarn group across 1 directory with 2 updates by @dependabot[bot] in #2452
- [JS] Fix error when LLMPipeline.stream() is broken. by @Retribution98 in #2454
- [TTS] Correct perf metrics for second or so generation by @rkazants in #2463
- Revert "Optimize VisionEncoderQwen2VL::encode (#2205)" by @wgzintel in #2468
- [DOCS] Fix README.md for text generation by @rkazants in #2469
- Update pybind11 to 3.0.0 by @Wovchena in #2476
- LoRA for text2image by @andreyanufr in #2440
- Fix pybind warning by @vshampor in #2456
- add lock mutex to
StructuredOutputController::get_times()
by @pavel-esir in #2459 - [llm_bench] Remove unused processor for genAI vlm pipeline by @sbalandi in #2460
- Skip NPU failing tests by @Wovchena in #2473
- [llm_bench] Add apply_chat_template for LLM by @sbalandi in #2475
- Fix Coverity by @Wovchena in #2458
- [GGUF] Revert GGUF WA for GPU by @sammysun0711 in #2392
- Bump timm from 1.0.17 to 1.0.19 by @dependabot[bot] in #2482
- Suppress C4146 warning for xgrammar by @olpipi in #2478
- [VLM] Fix disabling
apply_chat_template
in generation config by @yatarkan in #2484 - [CVS-170914] Disable GHA pipelines in merge queue by @akladiev in #2485
- image_generation: fix argument order by @Wovchena in #2462
- [WIN] Set binaries details by @mryzhov in #2433
- Run finish_chat() before start_chat() in VLM pipeline. by @popovaan in #2470
- Increase VLM timeout by @Wovchena in #2472
- Support non string chat templates by @yatarkan in #2426
- Dont run SDL in merge queue by @Wovchena in #2487
- Add C API for WhisperPipeline by @BrandonWeng in #2414
- Test optimized generation longbench trust_remote_code=true by @as-suvorov in #2492
- xfail image generation-samples by @Wovchena in #2493
- [GHA][CI] Separate GGUF Reader Test by @sammysun0711 in #2488
- wwb: no chat template for VLMs by @Wovchena in #2494
- Add text rerank pipeline by @as-suvorov in #2436
- Add TextEmbedinPipeline documentation by @as-suvorov in #2474
- Add TTS to llm_bench by @sbalandi in #2438
- Fix Generated tokens in sd debug metrics by @sbalandi in #2498
- [llm_bench] Add possibility to find stablelm model by model_type by @sbalandi in #2504
- [CI][GHA] Regenerate ov cache in windows by @sammysun0711 in #2509
- [llm_bench] Update logging info for stable-diffusion pipeline by @sbalandi in #2517
- Added clarification regarding the use of memory consumption mode to help by @sbalandi in #2499
- Allow building GenAI with free-threaded Python by @p-wysocki in #2515
- Bump langchain-core from 0.3.69 to 0.3.72 in /tests/python_tests by @dependabot[bot] in #2510
- Log why optimum-cli fails by @Wovchena in #2511
- [CI][GHA]Remove nightly tag in test by @sammysun0711 in #2506
- [CI][GHA]Fix RAG sample pytest import error by @sammysun0711 in #2505
- [GPU][QWen2-VL][QWen2.5-VL] improve SDPA performance with cu_seqlens and cu_window_seqlens by @ceciliapeng2011 in #2330
- Enable back NPU import tests by @smirnov-alexey in #2503
- Override load time in wrapper by @Wovchena in #2527
- Add stderr check for retry_request by @as-suvorov in #2528
- LoRA bug fixes by @likholat in #2525
- Enable image generation tests by @likholat in #2531
- WWB Readme fix by @likholat in #2536
- Update tokenizers by @Wovchena in #2524
- Creating image generation pipeline type from another type fix by @dkalinowski in #2530
- [NPU] Fix for whisper fp8 models by @eshiryae in #2539
- Switch to SDPA Qwen2VL and Qwen2.5VL by @popovaan in #2534
- Freeze dependencies by @Wovchena in #2535
- Log wwb errors by @Wovchena in #2546
- update xgrammar and openvino_tokenizers by @pavel-esir in #2465
- Bump actions/download-artifact from 4 to 5 by @dependabot[bot] in #2543
- Add long prompts to wwb for LLMPipeline by @sbalandi in #2547
- Update optimum-intel by @Wovchena in #2556
- Add link to BUILD.md by @Wovchena in #2559
- retry pip install by @Wovchena in #2560
- xfail image-inpainting by @Wovchena in #2562
- Utilize XAttention during prefill by @vshampor in #2489
- Add TextReranker documentation by @as-suvorov in #2500
- benchark_genai: fix LLM failure on NPU by @Wovchena in #2542
- Set encoding by @Wovchena in #2564
- Bump actions/cache from 4.2.3 to 4.2.4 by @dependabot[bot] in #2574
- Fix AWQ by @Wovchena in #2577
- [OV JS] Implement PerfMetrics for the LLMPipeline by @almilosz in #2545
- Bump langchain-core from 0.3.72 to 0.3.74 in /tests/python_tests by @dependabot[bot] in #2567
- Add minicpmo by @Wovchena in #2579
- Include type into error by @Wovchena in #2580
- Fix GGUF labeling by @Wovchena in #2583
- Test C++ first by @Wovchena in #2584
- [NPU]Support dynamic LoRA by @intelgaoxiong in #2477
- optimum-intel 1.25.1 was just released by @Wovchena in #2582
- Enable test_sample_lora_text2image by @Wovchena in #2578
- LLMPipeline(NPU): Configuration section to fine-tune LM head model by @AsyaPronina in #2317
- Log only if there is error by @Wovchena in #2585
- Enable minja as chat template engine by @yatarkan in #2439
- Extend padding by @pavel-esir in #2512
- [NPUW] Disable chunking on NPU for VLM pipeline by @AlexanderKalistratov in #2595
- Enable gemma3-4b-it in VLM Pipeline by @yangsu2022 in #2340
- [JS] Implement getTokenizer into LLMPipeline by @Retribution98 in #2586
- [JS] Upgrade the js package versions to the upcoming releases by @Retribution98 in #2596
- Add tests and samples dependencies gha component by @as-suvorov in #2594
- Add jinja2 in requirements for chat-template by @isanghao in #2597
- [CMAKE] Added an option to disable tests, samples and tools build by @mryzhov in #2561
- Bump safetensors from 0.5.3 to 0.6.2 in /samples by @as-suvorov in #2598
- Bump pybind11-stubgen from 2.5.4 to 2.5.5 by @as-suvorov in #2599
- Add 2nd input to Tokenizers transformation by @pavel-esir in #2457
- Bump torchvision from 0.17.2 to 0.23.0+cpu by @dependabot[bot] in #2570
- Fix SnapKV scoring by @vshampor in #2591
- Fix get_max_new_tokens() by @michalkulakowski in #2416
- Bump torchvision from 0.17.2 to 0.23.0+cpu in /tests/python_tests by @dependabot[bot] in #2566
- Bump actions/checkout from 4 to 5 by @dependabot[bot] in #2602
- preserve properties from allowlist in Tokenizer by @pavel-esir in #2604
- KVCrush method for cache eviction [Updated] by @gopikrishnajha in #2523
- Bump json5 from 0.12.0 to 0.12.1 by @dependabot[bot] in #2606
- Structured Output with Compound Grammar by @apaniukov in #2587
- Expose StructuredOutputConfig validation method by @mzegla in #2576
- [CMAKE][MERGE] Fix samples installation by @mryzhov in #2650
- [release] Upgrade transformers to 4.53.3 by @Wovchena in #2651
- [release] downgrade xgrammar version by @pavel-esir in #2664
- [release] Fix coverity for release by @as-suvorov in #2661
New Contributors
- @adrianboguszewski made their first contribution in #2287
- @JeevakaPrabu made their first contribution in #2446
- @BrandonWeng made their first contribution in #2414
- @ceciliapeng2011 made their first contribution in #2330
Full Changelog: 2025.2.0.0...2025.3.0.0