Skip to content

Conversation

jesusmb1995
Copy link

@jesusmb1995 jesusmb1995 commented Aug 28, 2025

Builds changes from #1 on top of Tether synced fork. After #4 was merged to sync with upstream and add Tether fork changes.

Git diff can be used to check the new re-based branch includes same changes of temp-load-from-buffer, changes shown should be those from the re-base:

➜  llamacpp_tether git:(temp-load-from-buffer-rebased-QVAC4552) ✗ git diff --name-only temp-load-from-buffer-rebased-QVAC4552 tetherto/temp-
load-from-buffer | tee
CMakeLists.txt
cmake/llama-config.cmake.in
common/CMakeLists.txt
ggml/CMakeLists.txt
ggml/cmake/ggml-config.cmake.in
ggml/src/CMakeLists.txt
ggml/src/ggml-vulkan/CMakeLists.txt
src/CMakeLists.txt
tools/mtmd/CMakeLists.txt

First commit of this PR is based on:

commit 88d711fad2ac0a1392eb916663448cbf71d64b0a
Author: Jesús <[email protected]>
Date:   Wed Jul 16 11:07:22 2025 +0200

    [common] Pure interface for files
    
    Convert llama_file to a pure virtual class that can be overriden by multiple implementations (disk, single memory buffer, ...)

commit ab269c4c47b19117b956d65539b733c0fe136d33 (tetherto/master, tetherto/HEAD)
Merge: 4fb255655 ce648804b
Author: Yury Samarin <[email protected]>
Date:   Thu Aug 28 08:59:08 2025 +0300

    Merge pull request #4 from jpgaribotti/QVAC-4552
    
    QVAC-4552: Sync port with upstream version b5932

commit ce648804b2881100bd83bd488c8e38a715e7431c (tag: b5932.0.0, jpgaribotti/QVAC-4552)
Author: Juan Pablo Garibotti Arias <[email protected]>
Date:   Wed Aug 13 12:42:39 2025 +0200

    Export mtmd target

Convert llama_file to a pure virtual class that can be overriden by multiple implementations (disk, single memory buffer, ...)
Define a new macro LLAMA_LOG_CMAKE_DEBUG that results in no-op when a release build is activated. This will allow to have a good trace and debugging capabilities that will be specially useful for the async loading of multiple model shards.
This change adds an additional automated test loading from disk, to ensure the existing functionallity does not break.
The gguf-split utility now generates a `.txt` listing all tensors. Useful both for manual inspection/debugging and for incremental tensor loading where its not possible to know tensors present in other split files (the information is critical to handle optional tensors).
Add a flag to the tool to ensure some tensor names are always followed by another tensor and not at the end of a shard. This ensures the shard will not be released when the tensor is processed, and avoid missing-file failures of duplicate tensors that are re-referenced a few tensors later (typically token_embd.weight / output).
Show to which shards belongs each tensor
- Ensures a char trait implementation for uint8 exists, that can be used with std::basic_streambuff.
- Adds an implementation of std::basic_streambuff for a single vector. Will be used by llama.cpp and tests when loading from a single memory buffer.
Override the pure virtual interface with a class that can operate on a single memory buffer.
Auxiliary function to convert a list of C strings to a vector of C++ strings.
Add new GGUF reader implementation that can read metadata from a memory buffer.
- Add code to be able to load a gguf file from a variant (memory or disk).
- Some structs simplify how to load a file and keep track of the pointers (which are now in the same struct).
Move the loader code, that process a file after it has been loaded into memory and populate its own attributes, to a reusable method.
Add new C++ function to Llama main header to load from a single memory buffer, and propagate changes to internal calls/constructors.
A file buffer that can be fulfilled using string keys. The extract method waits until the file is provided.
Handles the logic for incrementally loading files and tensors is model shards.
Refactor backend buffer creation (for model loading) into functions.
- The function now takes size_data instead of the member attribute.
- Sanity checks of file pointer handles

These two changes will be useful when calling `load_all_data` multiple times during incremental shard load.
Adapt the loader and model load to incrementally load files and upload tensors.
Add functions to Llama.cpp public headers to asynchronously load shards.
Split some common loading functionallity. This will help in the memory loading tests.
Add a submodule with re-usable code for tests.
Adapt embedding example to showcase how to load from memory. Can be configured through environment variables.
Adapt simple example to showcase how to load from memory. Can be configured with environment variables.

Qwen3, for example, can be used with the simple example.
Add some automatic tests that load from memory (single buffer or multiple async splits)
@jesusmb1995 jesusmb1995 force-pushed the temp-load-from-buffer-rebased-QVAC4552 branch from bbd1b71 to b6d441b Compare August 28, 2025 08:34
@olek-tether olek-tether merged commit e394035 into tetherto:master Aug 28, 2025
9 of 47 checks passed
jpgaribotti pushed a commit that referenced this pull request Sep 10, 2025
* oai moe

* compat with new checkpoint

* add attn sink impl

* add rope scaling yarn

* logits match with latest transformers code

* wip chat template

* rm trailing space

* use ggml_scale_bias

* rm redundant is_swa_all

* convert interleaved gate_up

* graph : fix activation function to match reference (#7)

* vocab : handle o200k_harmony special tokens

* ggml : add attention sinks support (#1)

* llama : add attn sinks

* ggml : add attn sinks

* cuda : add attn sinks

* vulkan : add support for sinks in softmax

remove unnecessary return

* ggml : add fused swiglu_oai op (#11)

* ggml : add fused swiglu_oai op

* Update ggml/src/ggml-cpu/ops.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

* update CUDA impl

* cont : metal impl

* add vulkan impl

* test-backend-ops : more test cases, clean up

* llama : remove unfused impl

* remove extra lines

---------

Co-authored-by: Georgi Gerganov <[email protected]>

---------

Co-authored-by: slaren <[email protected]>

* repack mxfp4 upon conversion

* clean up a bit

* enable thinking

* add quick hack to render only some special tokens

* fix bf16 conversion

* remove vocab hack

* webui ok

* support chat parsing for gpt-oss

* fix webui

* direct mapping mxfp4, FINALLY

* force using mxfp4

* properly use lazy tensor

* ggml : add mxfp4

ggml : use e8m0 conversion instead of powf

Co-authored-by: Diego Devesa <[email protected]>

change kvalues_mxfp4 table to match e2m1 (#6)

metal : remove quantization for now (not used)

cuda : fix disabled CUDA graphs due to ffn moe bias

vulkan : add support for mxfp4

cont : add cm2 dequant

* ggml : add ggml_add_id (#13)

* ggml : add ggml_add_id

* add cuda impl

* llama : add weight support check for add_id

* perf opt

* add vulkan impl

* rename cuda files

* add metal impl

* allow in-place ggml_add_id

* llama : keep biases on CPU with --cpu-moe

* llama : fix compile error

ggml-ci

* cuda : add fallback for __nv_cvt_e8m0_to_bf16raw

ggml-ci

* cleanup

ggml-ci

* sycl : fix supports_op for MXFP4

ggml-ci

* fix Unknown reasoning format

* ggml-cpu : fix AVX build

ggml-ci

* fix hip build

ggml-ci

* cuda : add mxfp4 dequantization support for cuBLAS

ggml-ci

* ggml-cpu : fix mxfp4 fallback definitions for some architectures

ggml-ci

* cuda : fix version required for __nv_cvt_e8m0_to_bf16raw

---------

Co-authored-by: Xuan Son Nguyen <[email protected]>
Co-authored-by: slaren <[email protected]>
jpgaribotti pushed a commit that referenced this pull request Sep 10, 2025
* [common] Pure interface for files

Convert llama_file to a pure virtual class that can be overriden by multiple implementations (disk, single memory buffer, ...)

* [common] Compile time debug logs

Define a new macro LLAMA_LOG_CMAKE_DEBUG that results in no-op when a release build is activated. This will allow to have a good trace and debugging capabilities that will be specially useful for the async loading of multiple model shards.

* [aux] Test full load from disk

This change adds an additional automated test loading from disk, to ensure the existing functionallity does not break.

* [aux] GGUF split summary

The gguf-split utility now generates a `.txt` listing all tensors. Useful both for manual inspection/debugging and for incremental tensor loading where its not possible to know tensors present in other split files (the information is critical to handle optional tensors).

* [aux] gguf tensor must be followed

Add a flag to the tool to ensure some tensor names are always followed by another tensor and not at the end of a shard. This ensures the shard will not be released when the tensor is processed, and avoid missing-file failures of duplicate tensors that are re-referenced a few tensors later (typically token_embd.weight / output).

* [aux] verbose gguf split

Show to which shards belongs each tensor

* [common] Stream buffer for uint8 data

- Ensures a char trait implementation for uint8 exists, that can be used with std::basic_streambuff.
- Adds an implementation of std::basic_streambuff for a single vector. Will be used by llama.cpp and tests when loading from a single memory buffer.

* [mbuffer] Llama file buffer implementation

Override the pure virtual interface with a class that can operate on a single memory buffer.

* [refactor] C splits into C++

Auxiliary function to convert a list of C strings to a vector of C++ strings.

* [common] GGUF reader from memory

Add new GGUF reader implementation that can read metadata from a memory buffer.

* [refactor][mbuffer] File load from variant

- Add code to be able to load a gguf file from a variant (memory or disk).
- Some structs simplify how to load a file and keep track of the pointers (which are now in the same struct).

* [refactor] Process file method

Move the loader code, that process a file after it has been loaded into memory and populate its own attributes, to a reusable method.

* [mbuffer] Expose single-buffer loading to Llama interface

Add new C++ function to Llama main header to load from a single memory buffer, and propagate changes to internal calls/constructors.

* [fbuffers] Future file buffer implementation

A file buffer that can be fulfilled using string keys. The extract method waits until the file is provided.

* [fbuffers] Incremental loading of future files

Handles the logic for incrementally loading files and tensors is model shards.

* [refactor] Create backend buffers

Refactor backend buffer creation (for model loading) into functions.

* [refactor] Load all data

- The function now takes size_data instead of the member attribute.
- Sanity checks of file pointer handles

These two changes will be useful when calling `load_all_data` multiple times during incremental shard load.

* [fbuffers] Incremental model load

Adapt the loader and model load to incrementally load files and upload tensors.

* [fbuffers] Expose async interface

Add functions to Llama.cpp public headers to asynchronously load shards.

* [refactor] Increase common loading granularity

Split some common loading functionallity. This will help in the memory loading tests.

* [aux] Common test

Add a submodule with re-usable code for tests.

* [aux] Memory example (embedding)

Adapt embedding example to showcase how to load from memory. Can be configured through environment variables.

* [aux] Memory example (simple)

Adapt simple example to showcase how to load from memory. Can be configured with environment variables.

Qwen3, for example, can be used with the simple example.

* [aux] Auto. memory loading tests

Add some automatic tests that load from memory (single buffer or multiple async splits)
gianni-cor pushed a commit to gianni-cor/qvac-ext-lib-llama.cpp that referenced this pull request Sep 18, 2025
…gml-org#16038)

Initalizing RESERVED_NAME in is_reserved_name() is not thread
safe and leads to corrupted memory when used from multiple threads
as can be seen in the asan trace below. This fixes the initialization
to make it thread-safe.

    #0 0x000100abd018 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) __hash_table:1565
    tetherto#1 0x000100ab0320 in SchemaConverter::visit(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) json-schema-to-grammar.cpp:802
    tetherto#2 0x000100aafc48 in std::__1::__function::__func<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2, std::__1::allocator<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> (std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319
    tetherto#3 0x000100a2c938 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&), std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>, void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319
    tetherto#4 0x000100a139f8 in foreach_function(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::function<void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)> const&) chat.cpp:762
    tetherto#5 0x000100a2a7f4 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0, std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0>, void (common_grammar_builder const&)>::operator()(common_grammar_builder const&) function.h:319
    tetherto#6 0x000100aa98f4 in build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&) json-schema-to-grammar.cpp:982
    tetherto#7 0x0001009c9314 in common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool) chat.cpp:1110
    tetherto#8 0x0001009b8afc in common_chat_templates_apply_jinja(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:1992
    tetherto#9 0x0001009b533c in common_chat_templates_apply(common_chat_templates const*, common_chat_templates_inputs const&) chat.cpp:2074
    tetherto#10 0x000100810120 in llamacpp_apply_chat_template+0x724 (predict_oai-98384e17fb94e863:arm64+0x100090120)
    ...

==45482==Register values:
 x[0] = 0x00006020004147f8   x[1] = 0x00006080000013c8   x[2] = 0x0000000000000000   x[3] = 0x0000604006289738
 x[4] = 0x0000000000000002   x[5] = 0x0000000000000001   x[6] = 0x04034000004b4000   x[7] = 0x0000000000000001
 x[8] = 0xbebebebebebebebe   x[9] = 0x17d7d7d7d7d7d7d7  x[10] = 0x00000c04000828ff  x[11] = 0x0000000000000001
x[12] = 0x000000002018d383  x[13] = 0x0000000000000000  x[14] = 0xfa0000000000fafa  x[15] = 0x000010700001ffff
x[16] = 0x000000019dc012c0  x[17] = 0x00000001021284f8  x[18] = 0x0000000000000000  x[19] = 0x00000001700acdc0
x[20] = 0x0000000000000002  x[21] = 0x000000002018d384  x[22] = 0x16dd16fd2e731151  x[23] = 0x0000007000020000
x[24] = 0x0000000100c69c08  x[25] = 0x0000000100c69c20  x[26] = 0x00006080000013c7  x[27] = 0x0000000100c69c00
x[28] = 0x00000001700acd60     fp = 0x00000001700aceb0     lr = 0x0000000100abce30     sp = 0x00000001700acd60
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV __hash_table:1565 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void*>*>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&)
Thread T5 created by T0 here:
    #0 0x0001020b99d4 in pthread_create+0x5c (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x359d4)
    tetherto#1 0x000100873910 in std::sys::pal::unix::thread::Thread::new::h77254fdd87a28e05+0x118 (predict_oai-98384e17fb94e863:arm64+0x1000f3910)
    tetherto#2 0x0001007c7a1c in test::run_test::haeb3c2bcd5ed6cf6+0x76c (predict_oai-98384e17fb94e863:arm64+0x100047a1c)
    tetherto#3 0x0001007aedb0 in test::console::run_tests_console::he9d142d704f3a986+0x149c (predict_oai-98384e17fb94e863:arm64+0x10002edb0)
    tetherto#4 0x0001007c5758 in test::test_main::hf86a5e20735245b9+0x118 (predict_oai-98384e17fb94e863:arm64+0x100045758)
    tetherto#5 0x0001007c5da0 in test::test_main_static::h61ee9c8fd30abca0+0x54 (predict_oai-98384e17fb94e863:arm64+0x100045da0)
    ...

==45482==ABORTING
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants