-
Notifications
You must be signed in to change notification settings - Fork 13.1k
Description
Name and Version
version: 5947 (2be60cbc)
built with cc (Debian 12.2.0-14+deb12u1) 12.2.0 for x86_64-linux-gnu
Operating systems
Linux
GGML backends
CUDA
Hardware
Ryzen Threadripper PRO 7975WX, 2x RTX 4090, 512 GiB DDR5 RAM
Models
https://huggingface.co/baidu/ERNIE-4.5-300B-A47B-PT trying to quantize with https://huggingface.co/mradermacher/ERNIE-4.5-300B-A47B-Base-PT-i1-GGUF/blob/main/imatrix.dat
Problem description & steps to reproduce
Try quantizing https://huggingface.co/baidu/ERNIE-4.5-300B-A47B-PT using https://huggingface.co/mradermacher/ERNIE-4.5-300B-A47B-Base-PT-i1-GGUF/blob/main/imatrix.dat as imatrix at IQ1_M or IQ1_S and llama.cpp will crash with GGML_ASSERT(besti1 >= 0 && besti2 >= 0 && best_shift != 0) failed
. This issue does not occur for high-bit quants no matter if an imatrix is used or not.
First Bad Commit
Broken since Ernie4.5 MoE support got introduced by @pwilkin and @CISC in #14658 and #14746. I tested befor and after #9400 got merged with the same result.
Relevant log output
root@AI:/apool/llama.cpp/build/bin# ./llama-quantize --imatrix /mradermacher/tmp/quant/ERNIE-4.5-300B-A47B-PT-i1-GGUF/imatrix.dat /mradermacher/tmp/quant/ERNIE-4.5-300B-A47B-PT.gguf /mradermacher/root/ERNIE-4.5-300B-A47B-PT.i1-IQ1_M.gguf IQ1_M
gguf_init_from_file_impl: invalid magic characters: '????', expected 'GGUF'
load_imatrix: imatrix file '/mradermacher/tmp/quant/ERNIE-4.5-300B-A47B-PT-i1-GGUF/imatrix.dat' is using old format
load_legacy_imatrix: imatrix dataset='imatrix-training-full-3'
load_legacy_imatrix: loaded 429 importance matrix entries from /mradermacher/tmp/quant/ERNIE-4.5-300B-A47B-PT-i1-GGUF/imatrix.dat computed on 336 chunks
prepare_imatrix: have 429 importance matrix entries
main: build = 5947 (2be60cbc)
main: built with cc (Debian 12.2.0-14+deb12u1) 12.2.0 for x86_64-linux-gnu
main: quantizing '/mradermacher/tmp/quant/ERNIE-4.5-300B-A47B-PT.gguf' to '/mradermacher/root/ERNIE-4.5-300B-A47B-PT.i1-IQ1_M.gguf' as IQ1_M
llama_model_loader: loaded meta data with 33 key-value pairs and 591 tensors from /mradermacher/tmp/quant/ERNIE-4.5-300B-A47B-PT.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = ernie4_5-moe
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = ERNIE 4.5 300B A47B PT
llama_model_loader: - kv 3: general.finetune str = PT
llama_model_loader: - kv 4: general.basename str = ERNIE-4.5
llama_model_loader: - kv 5: general.size_label str = 300B-A47B
llama_model_loader: - kv 6: general.license str = apache-2.0
llama_model_loader: - kv 7: general.tags arr[str,2] = ["ERNIE4.5", "text-generation"]
llama_model_loader: - kv 8: general.languages arr[str,2] = ["en", "zh"]
llama_model_loader: - kv 9: ernie4_5-moe.block_count u32 = 54
llama_model_loader: - kv 10: ernie4_5-moe.context_length u32 = 131072
llama_model_loader: - kv 11: ernie4_5-moe.embedding_length u32 = 8192
llama_model_loader: - kv 12: ernie4_5-moe.feed_forward_length u32 = 28672
llama_model_loader: - kv 13: ernie4_5-moe.attention.head_count u32 = 64
llama_model_loader: - kv 14: ernie4_5-moe.attention.head_count_kv u32 = 8
llama_model_loader: - kv 15: ernie4_5-moe.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 16: ernie4_5-moe.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 17: general.file_type u32 = 1
llama_model_loader: - kv 18: ernie4_5-moe.expert_count u32 = 64
llama_model_loader: - kv 19: ernie4_5-moe.expert_used_count u32 = 8
llama_model_loader: - kv 20: ernie4_5-moe.interleave_moe_layer_step u32 = 1
llama_model_loader: - kv 21: ernie4_5-moe.leading_dense_block_count u32 = 3
llama_model_loader: - kv 22: ernie4_5-moe.expert_feed_forward_length u32 = 3584
llama_model_loader: - kv 23: general.quantization_version u32 = 2
llama_model_loader: - kv 24: tokenizer.ggml.model str = llama
llama_model_loader: - kv 25: tokenizer.ggml.pre str = default
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,103424] = ["<unk>", "<s>", "</s>", "0", "1", "2...
llama_model_loader: - kv 27: tokenizer.ggml.scores arr[f32,103424] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 28: tokenizer.ggml.token_type arr[i32,103424] = [2, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 29: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 32: tokenizer.chat_template str = {%- if not add_generation_prompt is d...
llama_model_loader: - type f32: 211 tensors
llama_model_loader: - type f16: 380 tensors
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
================================ Have weights data with 429 entries
[ 1/ 591] output.weight - [ 8192, 103424, 1, 1], type = f16,
====== llama_model_quantize_impl: did not find weights for output.weight
converting to q5_K .. size = 1616.00 MiB -> 555.50 MiB
[ 2/ 591] output_norm.weight - [ 8192, 1, 1, 1], type = f32, size = 0.031 MB
[ 3/ 591] token_embd.weight - [ 8192, 103424, 1, 1], type = f16,
====== llama_model_quantize_impl: did not find weights for token_embd.weight
converting to q2_K .. size = 1616.00 MiB -> 265.12 MiB
[ 4/ 591] blk.0.attn_k.weight - [ 8192, 1024, 1, 1], type = f16, converting to iq1_m .. size = 16.00 MiB -> 1.75 MiB
[ 5/ 591] blk.0.attn_norm.weight - [ 8192, 1, 1, 1], type = f32, size = 0.031 MB
[ 6/ 591] blk.0.attn_output.weight - [ 8192, 8192, 1, 1], type = f16, converting to iq2_xxs .. size = 128.00 MiB -> 16.50 MiB
[ 7/ 591] blk.0.attn_q.weight - [ 8192, 8192, 1, 1], type = f16, converting to iq1_m .. size = 128.00 MiB -> 14.00 MiB
[ 8/ 591] blk.0.attn_v.weight - [ 8192, 1024, 1, 1], type = f16, converting to q4_K .. size = 16.00 MiB -> 4.50 MiB
[ 9/ 591] blk.0.ffn_down.weight - [28672, 8192, 1, 1], type = f16, converting to q2_K .. size = 448.00 MiB -> 73.50 MiB
[ 10/ 591] blk.0.ffn_gate.weight - [ 8192, 28672, 1, 1], type = f16, converting to iq1_m .. size = 448.00 MiB -> 49.00 MiB
[ 11/ 591] blk.0.ffn_norm.weight - [ 8192, 1, 1, 1], type = f32, size = 0.031 MB
[ 12/ 591] blk.0.ffn_up.weight - [ 8192, 28672, 1, 1], type = f16, converting to iq1_m .. size = 448.00 MiB -> 49.00 MiB
[ 13/ 591] blk.1.attn_k.weight - [ 8192, 1024, 1, 1], type = f16, converting to iq1_m .. size = 16.00 MiB -> 1.75 MiB
[ 14/ 591] blk.1.attn_norm.weight - [ 8192, 1, 1, 1], type = f32, size = 0.031 MB
[ 15/ 591] blk.1.attn_output.weight - [ 8192, 8192, 1, 1], type = f16, converting to iq2_xxs .. size = 128.00 MiB -> 16.50 MiB
[ 16/ 591] blk.1.attn_q.weight - [ 8192, 8192, 1, 1], type = f16, converting to iq1_m .. /apool/llama.cpp/ggml/src/ggml-quants.c:4445: GGML_ASSERT(besti1 >= 0 && besti2 >= 0 && best_k >= 0) failed
/apool/llama.cpp/ggml/src/ggml-quants.c:4445: GGML_ASSERT(besti1 >= 0 && besti2 >= 0 && best_k >= 0) failed
/apool/llama.cpp/ggml/src/ggml-quants.c:4445: GGML_ASSERT(besti1 >= 0 && besti2 >= 0 && best_k >= 0) failed
/apool/llama.cpp/ggml/src/ggml-quants.c:4445: GGML_ASSERT(besti1 >= 0 && besti2 >= 0 && best_k >= 0) failed
/apool/llama.cpp/ggml/src/ggml-quants.c:4445: GGML_ASSERT(besti1 >= 0 && besti2 >= 0 && best_k >= 0) failed
/apool/llama.cpp/ggml/src/ggml-quants.c:4445: GGML_ASSERT(besti1 >= 0 && besti2 >= 0 && best_k >= 0) failed
/apool/llama.cpp/ggml/src/ggml-quants.c:4445: GGML_ASSERT(besti1 >= 0 && besti2 >= 0 && best_k >= 0) failed
/apool/llama.cpp/ggml/src/ggml-quants.c:4445: GGML_ASSERT(besti1 >= 0 && besti2 >= 0 && best_k >= 0) failed
warning: process 1068134 is already traced by process 1069318
ptrace: Operation not permitted.
No stack.
The program is not being run.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x000075b2da97da90 in qsort@plt () from /apool/llama.cpp/build/bin/libggml-base.so
#0 0x000075b2da97da90 in qsort@plt () from /apool/llama.cpp/build/bin/libggml-base.so
#1 0x000075b2da9c878d in quantize_iq1_m () from /apool/llama.cpp/build/bin/libggml-base.so
#2 0x000075b2da98e87c in ggml_quantize_chunk () from /apool/llama.cpp/build/bin/libggml-base.so
#3 0x000075b2dab59e61 in llama_model_quantize_impl(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, llama_model_quantize_params const*) () from /apool/llama.cpp/build/bin/libllama.so
#4 0x000075b2dab5c369 in llama_model_quantize () from /apool/llama.cpp/build/bin/libllama.so
#5 0x00005575115389a2 in main ()
/apool/llama.cpp/ggml/src/ggml-quants.c:4445: GGML_ASSERT(besti1 >= 0 && besti2 >= 0 && best_k >= 0) failed
/apool/llama.cpp/ggml/src/ggml-quants.c:4445: GGML_ASSERT(besti1 >= 0 && besti2 >= 0 && best_k >= 0) failed
Aborted