Quantize bug: Ernie4.5 MoE 300B low-bit quantization crashes

### Name and Version

version: 5947 (2be60cbc)
built with cc (Debian 12.2.0-14+deb12u1) 12.2.0 for x86_64-linux-gnu

### Operating systems

Linux

### GGML backends

CUDA

### Hardware

Ryzen Threadripper PRO 7975WX, 2x RTX 4090, 512 GiB DDR5 RAM

### Models

https://huggingface.co/baidu/ERNIE-4.5-300B-A47B-PT trying to quantize with https://huggingface.co/mradermacher/ERNIE-4.5-300B-A47B-Base-PT-i1-GGUF/blob/main/imatrix.dat

### Problem description & steps to reproduce

Try quantizing https://huggingface.co/baidu/ERNIE-4.5-300B-A47B-PT using https://huggingface.co/mradermacher/ERNIE-4.5-300B-A47B-Base-PT-i1-GGUF/blob/main/imatrix.dat as imatrix at IQ1_M or IQ1_S and llama.cpp will crash with `GGML_ASSERT(besti1 >= 0 && besti2 >= 0 && best_shift != 0) failed`. This issue does not occur for high-bit quants no matter if an imatrix is used or not.

### First Bad Commit

Broken since Ernie4.5 MoE support got introduced by @pwilkin and @CISC in https://github.com/ggml-org/llama.cpp/pull/14658 and https://github.com/ggml-org/llama.cpp/pull/14746. I tested befor and after https://github.com/ggml-org/llama.cpp/pull/9400 got merged with the same result.

### Relevant log output

```yaml
root@AI:/apool/llama.cpp/build/bin# ./llama-quantize --imatrix /mradermacher/tmp/quant/ERNIE-4.5-300B-A47B-PT-i1-GGUF/imatrix.dat /mradermacher/tmp/quant/ERNIE-4.5-300B-A47B-PT.gguf /mradermacher/root/ERNIE-4.5-300B-A47B-PT.i1-IQ1_M.gguf IQ1_M
gguf_init_from_file_impl: invalid magic characters: '????', expected 'GGUF'
load_imatrix: imatrix file '/mradermacher/tmp/quant/ERNIE-4.5-300B-A47B-PT-i1-GGUF/imatrix.dat' is using old format
load_legacy_imatrix: imatrix dataset='imatrix-training-full-3'
load_legacy_imatrix: loaded 429 importance matrix entries from /mradermacher/tmp/quant/ERNIE-4.5-300B-A47B-PT-i1-GGUF/imatrix.dat computed on 336 chunks
prepare_imatrix: have 429 importance matrix entries
main: build = 5947 (2be60cbc)
main: built with cc (Debian 12.2.0-14+deb12u1) 12.2.0 for x86_64-linux-gnu
main: quantizing '/mradermacher/tmp/quant/ERNIE-4.5-300B-A47B-PT.gguf' to '/mradermacher/root/ERNIE-4.5-300B-A47B-PT.i1-IQ1_M.gguf' as IQ1_M
llama_model_loader: loaded meta data with 33 key-value pairs and 591 tensors from /mradermacher/tmp/quant/ERNIE-4.5-300B-A47B-PT.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = ernie4_5-moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = ERNIE 4.5 300B A47B PT
llama_model_loader: - kv   3:                           general.finetune str              = PT
llama_model_loader: - kv   4:                           general.basename str              = ERNIE-4.5
llama_model_loader: - kv   5:                         general.size_label str              = 300B-A47B
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                               general.tags arr[str,2]       = ["ERNIE4.5", "text-generation"]
llama_model_loader: - kv   8:                          general.languages arr[str,2]       = ["en", "zh"]
llama_model_loader: - kv   9:                   ernie4_5-moe.block_count u32              = 54
llama_model_loader: - kv  10:                ernie4_5-moe.context_length u32              = 131072
llama_model_loader: - kv  11:              ernie4_5-moe.embedding_length u32              = 8192
llama_model_loader: - kv  12:           ernie4_5-moe.feed_forward_length u32              = 28672
llama_model_loader: - kv  13:          ernie4_5-moe.attention.head_count u32              = 64
llama_model_loader: - kv  14:       ernie4_5-moe.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                ernie4_5-moe.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  16: ernie4_5-moe.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                          general.file_type u32              = 1
llama_model_loader: - kv  18:                  ernie4_5-moe.expert_count u32              = 64
llama_model_loader: - kv  19:             ernie4_5-moe.expert_used_count u32              = 8
llama_model_loader: - kv  20:     ernie4_5-moe.interleave_moe_layer_step u32              = 1
llama_model_loader: - kv  21:     ernie4_5-moe.leading_dense_block_count u32              = 3
llama_model_loader: - kv  22:    ernie4_5-moe.expert_feed_forward_length u32              = 3584
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,103424]  = ["<unk>", "<s>", "</s>", "0", "1", "2...
llama_model_loader: - kv  27:                      tokenizer.ggml.scores arr[f32,103424]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  28:                  tokenizer.ggml.token_type arr[i32,103424]  = [2, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  29:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  30:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  31:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  32:                    tokenizer.chat_template str              = {%- if not add_generation_prompt is d...
llama_model_loader: - type  f32:  211 tensors
llama_model_loader: - type  f16:  380 tensors
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
================================ Have weights data with 429 entries
[   1/ 591]                        output.weight - [ 8192, 103424,     1,     1], type =    f16, 
====== llama_model_quantize_impl: did not find weights for output.weight
converting to q5_K .. size =  1616.00 MiB ->   555.50 MiB
[   2/ 591]                   output_norm.weight - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[   3/ 591]                    token_embd.weight - [ 8192, 103424,     1,     1], type =    f16, 
====== llama_model_quantize_impl: did not find weights for token_embd.weight
converting to q2_K .. size =  1616.00 MiB ->   265.12 MiB
[   4/ 591]                  blk.0.attn_k.weight - [ 8192,  1024,     1,     1], type =    f16, converting to iq1_m .. size =    16.00 MiB ->     1.75 MiB
[   5/ 591]               blk.0.attn_norm.weight - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[   6/ 591]             blk.0.attn_output.weight - [ 8192,  8192,     1,     1], type =    f16, converting to iq2_xxs .. size =   128.00 MiB ->    16.50 MiB
[   7/ 591]                  blk.0.attn_q.weight - [ 8192,  8192,     1,     1], type =    f16, converting to iq1_m .. size =   128.00 MiB ->    14.00 MiB
[   8/ 591]                  blk.0.attn_v.weight - [ 8192,  1024,     1,     1], type =    f16, converting to q4_K .. size =    16.00 MiB ->     4.50 MiB
[   9/ 591]                blk.0.ffn_down.weight - [28672,  8192,     1,     1], type =    f16, converting to q2_K .. size =   448.00 MiB ->    73.50 MiB
[  10/ 591]                blk.0.ffn_gate.weight - [ 8192, 28672,     1,     1], type =    f16, converting to iq1_m .. size =   448.00 MiB ->    49.00 MiB
[  11/ 591]                blk.0.ffn_norm.weight - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[  12/ 591]                  blk.0.ffn_up.weight - [ 8192, 28672,     1,     1], type =    f16, converting to iq1_m .. size =   448.00 MiB ->    49.00 MiB
[  13/ 591]                  blk.1.attn_k.weight - [ 8192,  1024,     1,     1], type =    f16, converting to iq1_m .. size =    16.00 MiB ->     1.75 MiB
[  14/ 591]               blk.1.attn_norm.weight - [ 8192,     1,     1,     1], type =    f32, size =    0.031 MB
[  15/ 591]             blk.1.attn_output.weight - [ 8192,  8192,     1,     1], type =    f16, converting to iq2_xxs .. size =   128.00 MiB ->    16.50 MiB
[  16/ 591]                  blk.1.attn_q.weight - [ 8192,  8192,     1,     1], type =    f16, converting to iq1_m .. /apool/llama.cpp/ggml/src/ggml-quants.c:4445: GGML_ASSERT(besti1 >= 0 && besti2 >= 0 && best_k >= 0) failed
/apool/llama.cpp/ggml/src/ggml-quants.c:4445: GGML_ASSERT(besti1 >= 0 && besti2 >= 0 && best_k >= 0) failed
/apool/llama.cpp/ggml/src/ggml-quants.c:4445: GGML_ASSERT(besti1 >= 0 && besti2 >= 0 && best_k >= 0) failed
/apool/llama.cpp/ggml/src/ggml-quants.c:4445: GGML_ASSERT(besti1 >= 0 && besti2 >= 0 && best_k >= 0) failed
/apool/llama.cpp/ggml/src/ggml-quants.c:4445: GGML_ASSERT(besti1 >= 0 && besti2 >= 0 && best_k >= 0) failed
/apool/llama.cpp/ggml/src/ggml-quants.c:4445: GGML_ASSERT(besti1 >= 0 && besti2 >= 0 && best_k >= 0) failed
/apool/llama.cpp/ggml/src/ggml-quants.c:4445: GGML_ASSERT(besti1 >= 0 && besti2 >= 0 && best_k >= 0) failed
/apool/llama.cpp/ggml/src/ggml-quants.c:4445: GGML_ASSERT(besti1 >= 0 && besti2 >= 0 && best_k >= 0) failed
warning: process 1068134 is already traced by process 1069318
ptrace: Operation not permitted.
No stack.
The program is not being run.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x000075b2da97da90 in qsort@plt () from /apool/llama.cpp/build/bin/libggml-base.so
#0  0x000075b2da97da90 in qsort@plt () from /apool/llama.cpp/build/bin/libggml-base.so
#1  0x000075b2da9c878d in quantize_iq1_m () from /apool/llama.cpp/build/bin/libggml-base.so
#2  0x000075b2da98e87c in ggml_quantize_chunk () from /apool/llama.cpp/build/bin/libggml-base.so
#3  0x000075b2dab59e61 in llama_model_quantize_impl(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, llama_model_quantize_params const*) () from /apool/llama.cpp/build/bin/libllama.so
#4  0x000075b2dab5c369 in llama_model_quantize () from /apool/llama.cpp/build/bin/libllama.so
#5  0x00005575115389a2 in main ()
/apool/llama.cpp/ggml/src/ggml-quants.c:4445: GGML_ASSERT(besti1 >= 0 && besti2 >= 0 && best_k >= 0) failed
/apool/llama.cpp/ggml/src/ggml-quants.c:4445: GGML_ASSERT(besti1 >= 0 && besti2 >= 0 && best_k >= 0) failed
Aborted
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Quantize bug: Ernie4.5 MoE 300B low-bit quantization crashes #14788

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Quantize bug: Ernie4.5 MoE 300B low-bit quantization crashes #14788

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions