[GGUF] Serialize Generated OV Model for Faster LLMPipeline Init #2218

sammysun0711 · 2025-05-15T14:28:14Z

Details:
This PR aim to cache generated OV model from GGUF model in disk for faster subsequent LLMPipeline initialization w/ OpenVINO model cache.

Serialize generated OV model from GGUF model w/ GGUF Reader with properties ov::genai::enable_save_ov_model (default value is false)
User can check if OV model exists in same folder of GGUF model, load OV model directly instead of creating GGUF model w/ GGUF Reader.
If GGUF model updated, user need to take the responsibility for cache invalidation and re-generate OV model with GGUF Reader.
Use OPENVINO_LOG_LEVEL environment variable to control the verbose of GGUF related debug information, details please refer to DEBUG_LOG.md

Expected behavior:

Set environment variable: export OPENVINO_LOG_LEVEL=3
First run w/ GGUF model:
- build/samples/cpp/text_generation/greedy_causal_lm gguf_models/qwen2.5-0.5b-instruct-q4_0.gguf "Who are you?"
[GGUF Reader]: Loading and unpacking model from: gguf_models/qwen2.5-0.5b-instruct-q4_0.gguf
[GGUF Reader]: Loading and unpacking model done. Time: 196ms
[GGUF Reader]: Start generating OpenVINO model...
[GGUF Reader]: Save generated OpenVINO model to: gguf_models/openvino_model.xml done. Time: 466 ms
[GGUF Reader]: Model generation done. Time: 757ms
I am Qwen, a large language model created by Alibaba Cloud. I am a language model designed to assist users in generating human-like text, such as writing articles, stories, and even writing books. I am trained on a vast corpus of text data, including books, articles, and other written works. I am also trained on a large corpus of human language data, including written and spoken language. I am designed to provide information and insights to users, and to assist them in their tasks and
2nd run w/ OV model:
- build/samples/cpp/text_generation/greedy_causal_lm gguf_models "Who are you?"
I am Qwen, a large language model created by Alibaba Cloud. I am a language model designed to assist users in generating human-like text, such as writing articles, stories, and even writing books. I am trained on a vast corpus of text data, including books, articles, and other written works. I am also trained on a large corpus of human language data, including written and spoken language. I am designed to provide information and insights to users, and to assist them in their tasks and

Copilot

Pull Request Overview

This PR enhances the model initialization performance by caching generated OpenVINO models on disk and reusing them for subsequent runs. Key changes include:

In src/cpp/src/utils.cpp, adding a check for an existing cached OpenVINO model based on the GGUF model location.
In src/cpp/src/gguf_utils/gguf_modeling.cpp, introducing a new function to serialize and save the generated OpenVINO model, and invoking it during model creation.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
src/cpp/src/utils.cpp	Added logic to reuse a cached OpenVINO model if it exists.
src/cpp/src/gguf_utils/gguf_modeling.cpp	Added serialization function and integrated it into the model creation flow.

src/cpp/src/utils.cpp

src/cpp/src/gguf_utils/gguf_modeling.cpp

…zation

…LLMPipeline initialization

as-suvorov · 2025-05-16T08:40:50Z

@sammysun0711
Proposal looks like implicit model caching. OpenVINO model is being implicitly serialized on disk. And if gguf model has changed it seems outdated serialized OpenVINO model will be loaded.
Can we reuse OpenVINO cache_dir property for such scenario?

Form the logs I see gguf model load time is 245ms. Could you please also provide loading time for OpenVINO serialized model?

sammysun0711 · 2025-05-16T08:47:25Z

Hi @as-suvorov, thanks for your suggestion.

Can we reuse OpenVINO cache_dir property for such scenario?

Do you means if cache_dir property set in property, then we can enable save generated OV model in cache_dir for re-use? I think it is a good proposal that user can control from application whether to serialize OV model in disk explicitly.

Could you please also provide loading time for OpenVINO serialized model?

Sure, I will add loading time for OpenVINO serialized model

as-suvorov · 2025-05-16T09:23:55Z

@sammysun0711 yes, correct. I guess also need to check if and how OpenVINO invalidates cached model and implement same approach for gguf format.

…rement for loading OV model.

sammysun0711 · 2025-05-16T13:45:12Z

Add support to save OV model based on ov::cache_dir properties explicit
Add time measurement for loading OV model.

1st Run w/o model cache:

Loading and unpacking model from: gguf_models/qwen2.5-0.5b-instruct-q4_0.gguf
Loading and unpacking model done. Time: 202ms
Start generating OpenVINO model...
Model generation done. Time: 292ms

2nd Run w/ model cache + serialize OV model:

Loading and unpacking model from: gguf_models/qwen2.5-0.5b-instruct-q4_0.gguf
Loading and unpacking model done. Time: 189ms
Start generating OpenVINO model...
Save generated OpenVINO model to: model_cache/qwen2.5-0.5b-instruct-q4_0_openvino_model.xml done. Time: 379 ms
Model generation done. Time: 647ms

3rd Run w/ model cache using generated OV model

Found generated OpenVINO model: model_cache/qwen2.5-0.5b-instruct-q4_0_openvino_model.xml, skip creating from GGUF model.
Loading OpenVINO model done. Time: 71ms

check if and how OpenVINO invalidates cached model and implement same approach for gguf format

Current naive method use GGUF file name to check if new OV model need to be regenerated, OV invalidate outdated model cache via hash calculate by compute_hash.hpp, but it seems compute_hash.hpp belong to dev_api, which is not accessible unless openvino.genai static link to openvino. @as-suvorov, may I know if you have an suggestions?

sammysun0711 · 2025-05-19T11:44:41Z

I added initial hash function in test branch, while found that hash function overhead is higher than model construction method.

Loading and unpacking model from: gguf_models/qwen2.5-0.5b-instruct-q4_0.gguf
Loading and unpacking model done. Time: 185ms
Compute GGUF hash with SHA1 done. Time: 856ms
Start generating OpenVINO model...
Save generated OpenVINO model to: model_cache/qwen2.5-0.5b-instruct-q4_0_openvino_model.xml done. Time: 567 ms
Model generation done. Time: 1702ms

Although can try to optimize hash method with parallelism, the hash function will be called for gguf load, which is not idea if we would like to skip gguf model load & unpack process if OV model existed in model cache.

I would like to suggest to serialize OV model in cache_dir when user set cache_dir property explicitly, and add verbose that existed cache OV model will skip GGUF->OV conversion w/o cache invalidate. User may need to clean up model cache and re-generate OV model if gguf model updated.

We can add hash function check after compute_hash.hpp expose as public API for re-use.

as-suvorov · 2025-05-19T11:53:54Z

I added initial hash function in test branch, while found that hash function overhead is higher than model construction method.

Loading and unpacking model from: gguf_models/qwen2.5-0.5b-instruct-q4_0.gguf
Loading and unpacking model done. Time: 185ms
Compute GGUF hash with SHA1 done. Time: 856ms
Start generating OpenVINO model...
Save generated OpenVINO model to: model_cache/qwen2.5-0.5b-instruct-q4_0_openvino_model.xml done. Time: 567 ms
Model generation done. Time: 1702ms

Although can try to optimize hash method with parallelism, the hash function will be called for gguf load, which is not idea if we would like to skip gguf model load & unpack process if OV model existed in model cache.

I would like to suggest to serialize OV model in cache_dir when user set cache_dir property explicitly, and add verbose that existed cache OV model will skip GGUF->OV conversion w/o cache invalidate. User may need to clean up model cache and re-generate OV model if gguf model updated.

We can add hash function check after compute_hash.hpp expose as public API for re-use.

@Wovchena What do you think?

src/cpp/src/utils.cpp

tests/python_tests/test_llm_pipeline.py

rkazants

need review again

…mited memory storage

…se by limited memory storage" This reverts commit 3ef08bf.

sammysun0711 · 2025-06-10T14:47:45Z

need review again

Refactor test and all tests passed, please kindly review again, thanks!

…vinotoolkit#2218) **Details:** This PR aim to cache generated OV model from GGUF model in disk for faster subsequent LLMPipeline initialization w/ OpenVINO model cache. - Serialize generated OV model from GGUF model w/ GGUF Reader with properties `ov::genai::enable_save_ov_model` (default value is `false`) - User can check if OV model exists in same folder of GGUF model, load OV model directly instead of creating GGUF model w/ GGUF Reader. - If GGUF model updated, user need to take the responsibility for cache invalidation and re-generate OV model with GGUF Reader. - Use `OPENVINO_LOG_LEVEL` environment variable to control the verbose of GGUF related debug information, details please refer to [DEBUG_LOG.md](https://github.com/openvinotoolkit/openvino.genai/blob/master/src/docs/DEBUG_LOG.md) **Expected behavior:** - Set environment variable: `export OPENVINO_LOG_LEVEL=3` - First run w/ GGUF model: - `build/samples/cpp/text_generation/greedy_causal_lm gguf_models/qwen2.5-0.5b-instruct-q4_0.gguf "Who are you?" ` > [GGUF Reader]: Loading and unpacking model from: gguf_models/qwen2.5-0.5b-instruct-q4_0.gguf [GGUF Reader]: Loading and unpacking model done. Time: 196ms [GGUF Reader]: Start generating OpenVINO model... [GGUF Reader]: Save generated OpenVINO model to: gguf_models/openvino_model.xml done. Time: 466 ms [GGUF Reader]: Model generation done. Time: 757ms I am Qwen, a large language model created by Alibaba Cloud. I am a language model designed to assist users in generating human-like text, such as writing articles, stories, and even writing books. I am trained on a vast corpus of text data, including books, articles, and other written works. I am also trained on a large corpus of human language data, including written and spoken language. I am designed to provide information and insights to users, and to assist them in their tasks and - 2nd run w/ OV model: - `build/samples/cpp/text_generation/greedy_causal_lm gguf_models "Who are you?"` > I am Qwen, a large language model created by Alibaba Cloud. I am a language model designed to assist users in generating human-like text, such as writing articles, stories, and even writing books. I am trained on a vast corpus of text data, including books, articles, and other written works. I am also trained on a large corpus of human language data, including written and spoken language. I am designed to provide information and insights to users, and to assist them in their tasks and --------- Co-authored-by: Andrei Kochin <[email protected]> Co-authored-by: Copilot <[email protected]>

github-actions bot added no-match-files category: GGUF GGUF file reader labels May 15, 2025

sammysun0711 changed the title ~~[GGUF] Cache Generated OV Model for Faster Initialziation~~ [GGUF] Serialize Generated OV Model for Faster Pipeline Initialization May 15, 2025

sammysun0711 changed the title ~~[GGUF] Serialize Generated OV Model for Faster Pipeline Initialization~~ [GGUF] Serialize Generated OV Model for Faster Pipeline Init May 15, 2025

sammysun0711 changed the title ~~[GGUF] Serialize Generated OV Model for Faster Pipeline Init~~ [GGUF] Serialize Generated OV Model for Faster LLMPipeline Init May 15, 2025

sammysun0711 requested review from Copilot and Wovchena May 15, 2025 14:30

Copilot AI reviewed May 15, 2025

View reviewed changes

src/cpp/src/utils.cpp Outdated Show resolved Hide resolved

src/cpp/src/gguf_utils/gguf_modeling.cpp Outdated Show resolved Hide resolved

Xiake Sun added 3 commits May 15, 2025 22:37

Serialize generated OV model from GGUF model for faster pipe initiali…

421627a

…zation

Add try-catch to handle expecption raise by serialize, continue with …

7d1c9de

…LLMPipeline initialization

Minior refactor to handle different gguf model in same directory

cb883ba

andrei-kochin requested review from rkazants and apaniukov May 16, 2025 08:33

as-suvorov assigned as-suvorov and Wovchena May 16, 2025

Explicit save model based on ov::cache_dir properties, add time measu…

67b2fd7

…rement for loading OV model.

Xiake Sun added 2 commits May 19, 2025 12:24

use ov:save model to compress OV model

20c24b4

Merge branch 'master' into gguf_model_cache

14bc5f5

Wovchena reviewed May 19, 2025

View reviewed changes

src/cpp/src/utils.cpp Outdated Show resolved Hide resolved

Xiake Sun added 2 commits May 20, 2025 08:31

Merge branch 'master' into gguf_model_cache

ff05f51

Implict cache generated ov model constructed from gguf

63fd0ee

sammysun0711 requested a review from Wovchena May 20, 2025 05:09

sammysun0711 added this to the 2025.2 milestone May 20, 2025

sammysun0711 added the Code Freeze label May 20, 2025

rkazants approved these changes May 30, 2025

View reviewed changes

Xiake Sun added 2 commits June 3, 2025 22:50

Merge branch 'master' into gguf_model_cache

4788fd3

test enable_save_ov_model=False only

13ab0df

rkazants reviewed Jun 4, 2025

View reviewed changes

tests/python_tests/test_llm_pipeline.py Outdated Show resolved Hide resolved

rkazants self-requested a review June 4, 2025 06:39

rkazants requested changes Jun 4, 2025

View reviewed changes

github-actions bot added the category: GHA CI based on Github actions label Jun 4, 2025

Xiake Sun added 3 commits June 4, 2025 16:28

enable save_ov_model test

fb4fb17

[Debug only] try use macos-13-large to check if core dump cause by li…

3ef08bf

…mited memory storage

Merge branch 'master' into gguf_model_cache

a05b435

github-actions bot removed the category: GHA CI based on Github actions label Jun 6, 2025

Xiake Sun added 8 commits June 6, 2025 10:52

Revert "[Debug only] try use macos-13-large to check if core dump cau…

cad5068

…se by limited memory storage" This reverts commit 3ef08bf.

release unused pipeline with gc to save memory

c4a8ce1

try to further reduce test memory usage

62448b6

reduce memory usage for test_pipelines_with_gguf_generate

d3147b5

Split separate test for gguf enable_save_ov_model to save memory usage

3c232ac

Fix merge conflict

1283de6

Refactor test

b6a8384

Fix merge conflict

e354e4e

sammysun0711 requested a review from rkazants June 10, 2025 14:45

rkazants approved these changes Jun 12, 2025

View reviewed changes

rkazants added this pull request to the merge queue Jun 12, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 12, 2025

Merge branch 'master' into gguf_model_cache

e618f5c

rkazants added this pull request to the merge queue Jun 13, 2025

Merged via the queue into openvinotoolkit:master with commit 035b591 Jun 13, 2025
89 of 91 checks passed

sammysun0711 mentioned this pull request Jun 20, 2025

Port [GGUF] Serialize Generated OV Model for Faster LLMPipeline Init (#2218) to 25.2 #2366

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[GGUF] Serialize Generated OV Model for Faster LLMPipeline Init #2218

[GGUF] Serialize Generated OV Model for Faster LLMPipeline Init #2218

Uh oh!

sammysun0711 commented May 15, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

as-suvorov commented May 16, 2025

Uh oh!

sammysun0711 commented May 16, 2025

Uh oh!

as-suvorov commented May 16, 2025

Uh oh!

sammysun0711 commented May 16, 2025

Uh oh!

sammysun0711 commented May 19, 2025 •

edited

Loading

Uh oh!

as-suvorov commented May 19, 2025

Uh oh!

Uh oh!

Uh oh!

rkazants left a comment

Uh oh!

sammysun0711 commented Jun 10, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[GGUF] Serialize Generated OV Model for Faster LLMPipeline Init #2218

[GGUF] Serialize Generated OV Model for Faster LLMPipeline Init #2218

Uh oh!

Conversation

sammysun0711 commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

as-suvorov commented May 16, 2025

Uh oh!

sammysun0711 commented May 16, 2025

Uh oh!

as-suvorov commented May 16, 2025

Uh oh!

sammysun0711 commented May 16, 2025

Uh oh!

sammysun0711 commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

as-suvorov commented May 19, 2025

Uh oh!

Uh oh!

Uh oh!

rkazants left a comment

Choose a reason for hiding this comment

Uh oh!

sammysun0711 commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sammysun0711 commented May 15, 2025 •

edited

Loading

sammysun0711 commented May 19, 2025 •

edited

Loading

sammysun0711 commented Jun 10, 2025 •

edited

Loading