Skip to content

Conversation

sammysun0711
Copy link

@sammysun0711 sammysun0711 commented May 15, 2025

Details:
This PR aim to cache generated OV model from GGUF model in disk for faster subsequent LLMPipeline initialization w/ OpenVINO model cache.

  • Serialize generated OV model from GGUF model w/ GGUF Reader with properties ov::genai::enable_save_ov_model (default value is false)
  • User can check if OV model exists in same folder of GGUF model, load OV model directly instead of creating GGUF model w/ GGUF Reader.
  • If GGUF model updated, user need to take the responsibility for cache invalidation and re-generate OV model with GGUF Reader.
  • Use OPENVINO_LOG_LEVEL environment variable to control the verbose of GGUF related debug information, details please refer to DEBUG_LOG.md

Expected behavior:

  • Set environment variable: export OPENVINO_LOG_LEVEL=3

  • First run w/ GGUF model:

    • build/samples/cpp/text_generation/greedy_causal_lm gguf_models/qwen2.5-0.5b-instruct-q4_0.gguf "Who are you?"

    [GGUF Reader]: Loading and unpacking model from: gguf_models/qwen2.5-0.5b-instruct-q4_0.gguf
    [GGUF Reader]: Loading and unpacking model done. Time: 196ms
    [GGUF Reader]: Start generating OpenVINO model...
    [GGUF Reader]: Save generated OpenVINO model to: gguf_models/openvino_model.xml done. Time: 466 ms
    [GGUF Reader]: Model generation done. Time: 757ms
    I am Qwen, a large language model created by Alibaba Cloud. I am a language model designed to assist users in generating human-like text, such as writing articles, stories, and even writing books. I am trained on a vast corpus of text data, including books, articles, and other written works. I am also trained on a large corpus of human language data, including written and spoken language. I am designed to provide information and insights to users, and to assist them in their tasks and

  • 2nd run w/ OV model:

    • build/samples/cpp/text_generation/greedy_causal_lm gguf_models "Who are you?"

    I am Qwen, a large language model created by Alibaba Cloud. I am a language model designed to assist users in generating human-like text, such as writing articles, stories, and even writing books. I am trained on a vast corpus of text data, including books, articles, and other written works. I am also trained on a large corpus of human language data, including written and spoken language. I am designed to provide information and insights to users, and to assist them in their tasks and

@sammysun0711 sammysun0711 changed the title [GGUF] Cache Generated OV Model for Faster Initialziation [GGUF] Serialize Generated OV Model for Faster Pipeline Initialization May 15, 2025
@sammysun0711 sammysun0711 changed the title [GGUF] Serialize Generated OV Model for Faster Pipeline Initialization [GGUF] Serialize Generated OV Model for Faster Pipeline Init May 15, 2025
@sammysun0711 sammysun0711 changed the title [GGUF] Serialize Generated OV Model for Faster Pipeline Init [GGUF] Serialize Generated OV Model for Faster LLMPipeline Init May 15, 2025
@sammysun0711 sammysun0711 requested review from Copilot and Wovchena May 15, 2025 14:30
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the model initialization performance by caching generated OpenVINO models on disk and reusing them for subsequent runs. Key changes include:

  • In src/cpp/src/utils.cpp, adding a check for an existing cached OpenVINO model based on the GGUF model location.
  • In src/cpp/src/gguf_utils/gguf_modeling.cpp, introducing a new function to serialize and save the generated OpenVINO model, and invoking it during model creation.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/cpp/src/utils.cpp Added logic to reuse a cached OpenVINO model if it exists.
src/cpp/src/gguf_utils/gguf_modeling.cpp Added serialization function and integrated it into the model creation flow.

@as-suvorov
Copy link
Collaborator

@sammysun0711
Proposal looks like implicit model caching. OpenVINO model is being implicitly serialized on disk. And if gguf model has changed it seems outdated serialized OpenVINO model will be loaded.
Can we reuse OpenVINO cache_dir property for such scenario?

Form the logs I see gguf model load time is 245ms. Could you please also provide loading time for OpenVINO serialized model?

@sammysun0711
Copy link
Author

Hi @as-suvorov, thanks for your suggestion.

Can we reuse OpenVINO cache_dir property for such scenario?

Do you means if cache_dir property set in property, then we can enable save generated OV model in cache_dir for re-use? I think it is a good proposal that user can control from application whether to serialize OV model in disk explicitly.

Could you please also provide loading time for OpenVINO serialized model?

Sure, I will add loading time for OpenVINO serialized model

@as-suvorov
Copy link
Collaborator

@sammysun0711 yes, correct. I guess also need to check if and how OpenVINO invalidates cached model and implement same approach for gguf format.

@sammysun0711
Copy link
Author

  • Add support to save OV model based on ov::cache_dir properties explicit
  • Add time measurement for loading OV model.

1st Run w/o model cache:

Loading and unpacking model from: gguf_models/qwen2.5-0.5b-instruct-q4_0.gguf
Loading and unpacking model done. Time: 202ms
Start generating OpenVINO model...
Model generation done. Time: 292ms

2nd Run w/ model cache + serialize OV model:

Loading and unpacking model from: gguf_models/qwen2.5-0.5b-instruct-q4_0.gguf
Loading and unpacking model done. Time: 189ms
Start generating OpenVINO model...
Save generated OpenVINO model to: model_cache/qwen2.5-0.5b-instruct-q4_0_openvino_model.xml done. Time: 379 ms
Model generation done. Time: 647ms

3rd Run w/ model cache using generated OV model

Found generated OpenVINO model: model_cache/qwen2.5-0.5b-instruct-q4_0_openvino_model.xml, skip creating from GGUF model.
Loading OpenVINO model done. Time: 71ms

check if and how OpenVINO invalidates cached model and implement same approach for gguf format

Current naive method use GGUF file name to check if new OV model need to be regenerated, OV invalidate outdated model cache via hash calculate by compute_hash.hpp, but it seems compute_hash.hpp belong to dev_api, which is not accessible unless openvino.genai static link to openvino. @as-suvorov, may I know if you have an suggestions?

@sammysun0711
Copy link
Author

sammysun0711 commented May 19, 2025

I added initial hash function in test branch, while found that hash function overhead is higher than model construction method.

Loading and unpacking model from: gguf_models/qwen2.5-0.5b-instruct-q4_0.gguf
Loading and unpacking model done. Time: 185ms
Compute GGUF hash with SHA1 done. Time: 856ms
Start generating OpenVINO model...
Save generated OpenVINO model to: model_cache/qwen2.5-0.5b-instruct-q4_0_openvino_model.xml done. Time: 567 ms
Model generation done. Time: 1702ms

Although can try to optimize hash method with parallelism, the hash function will be called for gguf load, which is not idea if we would like to skip gguf model load & unpack process if OV model existed in model cache.

I would like to suggest to serialize OV model in cache_dir when user set cache_dir property explicitly, and add verbose that existed cache OV model will skip GGUF->OV conversion w/o cache invalidate. User may need to clean up model cache and re-generate OV model if gguf model updated.

We can add hash function check after compute_hash.hpp expose as public API for re-use.

@as-suvorov
Copy link
Collaborator

I added initial hash function in test branch, while found that hash function overhead is higher than model construction method.

Loading and unpacking model from: gguf_models/qwen2.5-0.5b-instruct-q4_0.gguf
Loading and unpacking model done. Time: 185ms
Compute GGUF hash with SHA1 done. Time: 856ms
Start generating OpenVINO model...
Save generated OpenVINO model to: model_cache/qwen2.5-0.5b-instruct-q4_0_openvino_model.xml done. Time: 567 ms
Model generation done. Time: 1702ms

Although can try to optimize hash method with parallelism, the hash function will be called for gguf load, which is not idea if we would like to skip gguf model load & unpack process if OV model existed in model cache.

I would like to suggest to serialize OV model in cache_dir when user set cache_dir property explicitly, and add verbose that existed cache OV model will skip GGUF->OV conversion w/o cache invalidate. User may need to clean up model cache and re-generate OV model if gguf model updated.

We can add hash function check after compute_hash.hpp expose as public API for re-use.

@Wovchena What do you think?

@sammysun0711 sammysun0711 requested a review from Wovchena May 20, 2025 05:09
@sammysun0711 sammysun0711 added this to the 2025.2 milestone May 20, 2025
@rkazants rkazants self-requested a review June 4, 2025 06:39
Copy link
Collaborator

@rkazants rkazants left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need review again

@github-actions github-actions bot added the category: GHA CI based on Github actions label Jun 4, 2025
@github-actions github-actions bot removed the category: GHA CI based on Github actions label Jun 6, 2025
@sammysun0711 sammysun0711 requested a review from rkazants June 10, 2025 14:45
@sammysun0711
Copy link
Author

sammysun0711 commented Jun 10, 2025

need review again

Refactor test and all tests passed, please kindly review again, thanks!

@rkazants rkazants added this pull request to the merge queue Jun 12, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 12, 2025
@rkazants rkazants added this pull request to the merge queue Jun 13, 2025
Merged via the queue into openvinotoolkit:master with commit 035b591 Jun 13, 2025
89 of 91 checks passed
sammysun0711 pushed a commit to sammysun0711/openvino.genai that referenced this pull request Jun 20, 2025
…vinotoolkit#2218)

**Details:**
This PR aim to cache generated OV model from GGUF model in disk for
faster subsequent LLMPipeline initialization w/ OpenVINO model cache.
- Serialize generated OV model from GGUF model w/ GGUF Reader with
properties `ov::genai::enable_save_ov_model` (default value is `false`)
- User can check if OV model exists in same folder of GGUF model, load
OV model directly instead of creating GGUF model w/ GGUF Reader.
- If GGUF model updated, user need to take the responsibility for cache
invalidation and re-generate OV model with GGUF Reader.
- Use `OPENVINO_LOG_LEVEL` environment variable to control the verbose
of GGUF related debug information, details please refer to
[DEBUG_LOG.md](https://github.com/openvinotoolkit/openvino.genai/blob/master/src/docs/DEBUG_LOG.md)

**Expected behavior:**
- Set environment variable: `export OPENVINO_LOG_LEVEL=3`
-  First run w/ GGUF model:
- `build/samples/cpp/text_generation/greedy_causal_lm
gguf_models/qwen2.5-0.5b-instruct-q4_0.gguf "Who are you?" `

> [GGUF Reader]: Loading and unpacking model from:
gguf_models/qwen2.5-0.5b-instruct-q4_0.gguf
[GGUF Reader]: Loading and unpacking model done. Time: 196ms
[GGUF Reader]: Start generating OpenVINO model...
[GGUF Reader]: Save generated OpenVINO model to:
gguf_models/openvino_model.xml done. Time: 466 ms
[GGUF Reader]: Model generation done. Time: 757ms
I am Qwen, a large language model created by Alibaba Cloud. I am a
language model designed to assist users in generating human-like text,
such as writing articles, stories, and even writing books. I am trained
on a vast corpus of text data, including books, articles, and other
written works. I am also trained on a large corpus of human language
data, including written and spoken language. I am designed to provide
information and insights to users, and to assist them in their tasks and

- 2nd run w/ OV model:
- `build/samples/cpp/text_generation/greedy_causal_lm gguf_models "Who
are you?"`
> I am Qwen, a large language model created by Alibaba Cloud. I am a
language model designed to assist users in generating human-like text,
such as writing articles, stories, and even writing books. I am trained
on a vast corpus of text data, including books, articles, and other
written works. I am also trained on a large corpus of human language
data, including written and spoken language. I am designed to provide
information and insights to users, and to assist them in their tasks and

---------

Co-authored-by: Andrei Kochin <[email protected]>
Co-authored-by: Copilot <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: continuous batching Continuous batching category: CPP API Changes in GenAI C++ public headers category: GGUF GGUF file reader category: LLM LLM pipeline (stateful, static) category: tokenizers Tokenizer class or submodule update no-match-files
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants