Custom quantization rules with regular expressions #244

ikawrakow · 2025-03-06T15:45:14Z

For DeepSeekV3/R1 it is handy to be able to define custom rules for picking quantization types for the various tensors. Well, this is useful in general, but particularly useful for very large models where one wants to squeeze the last bit of quantized model quality for the smallest possible model size.

This PR adds this ability. Using

./bin/llama-quantize --imatrix some_imatrix --custom-q "regex1=typ1,regex2=type2..." some_model some_output_file some_base_quant

one can pass custom rules to the quantization function. The rules are comma separated (but one can also use multiple --custom-q arguments). The custom rules are processed in order and the first match is taken. So, for instance, if I use

--custom-q "\.ffn_down_exps\.weight=iq4_nl,\.ffn_.*_exps\.weight=iq1_s_r4"

the second rule matches the ffn_down experts, but because a match was found in the first rule, IQ4_NL will get used for blk.*.ffn_down_exps.weight, and IQ1_S_R4 will get used for the ffn_up and ffn_gate experts tensors.

To summarize how the quantization type is determined:

The type is set to the quantization type specified on the command line as last argument
If there are rules added via --attn-q-type, --attn-k-type, --attn-v-type, --attn-qkv-type, --attn-output-type, --ffn-gate-type, --ffn-down-type, --ffn-up-type, and the tensor is one of those, the type specified that way gets used (for now)
Else, the built-in rules get applied.
If there are custom rules provided and the tensor name matches one of the regular expressions in the custom rules, the type specified in the first match found becomes the selected quantization type for the tensor, retrospectively of what might have happened in steps 1-3.
If the tensor row size is not a multiple of the block size of the type selected in 1-4, the type is overridden with a built-in rule that maps quants with bock sizes > 32 to one of the quants with block size 32.

davidsyoung · 2025-03-06T17:58:36Z

This is awesome. It’ll come in really useful!

Iwan Kawrakow added 2 commits March 6, 2025 17:35

Custom quantization rules with regular expressions

480c226

Add the --custom-q option to the help

d29f8d3

ikawrakow merged commit c67a37b into main Mar 7, 2025

davidsyoung mentioned this pull request Mar 8, 2025

SER - Smart Expert Reduction #239

Merged

saood06 mentioned this pull request Mar 14, 2025

Compare ik_llama.cpp, vLLM, llama.cpp, and ktransformers engines ubergarm/r1-ktransformers-guide#11

Open

This was referenced Apr 4, 2025

Update llama-quant.cpp llama_tensor_get_type with DeepSeek friendly modifications ggml-org/llama.cpp#12727

Open

quantize: Handle user-defined quantization levels for additional tensors ggml-org/llama.cpp#12511

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Custom quantization rules with regular expressions #244

Custom quantization rules with regular expressions #244

Uh oh!

ikawrakow commented Mar 6, 2025 •

edited

Loading

Uh oh!

davidsyoung commented Mar 6, 2025

Uh oh!

Uh oh!

Custom quantization rules with regular expressions #244

Custom quantization rules with regular expressions #244

Uh oh!

Conversation

ikawrakow commented Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidsyoung commented Mar 6, 2025

Uh oh!

Uh oh!

ikawrakow commented Mar 6, 2025 •

edited

Loading