-
Notifications
You must be signed in to change notification settings - Fork 142
New IQ2_KT, IQ3_KT and IQ4_KT, V2 #529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
tl;dr;Something seems off as perplexity on both of my new tests for this PR529 are much higher now than a previous attempt around June 3rd with commits around PR484. I double checked my logs and commands and confirmed using the same imatrix etc so not sure what is going on. I've been compiling for CPU only fwiw. Details below. ExperimentOkay, doing a fresh test using this new PR529 on DeepSeek-R1-0528. I made two almost identical quants that differ only in the commit used to quantize/test/benchmark. Quantization was done roughly simultaneously, one on each socket of a dual socket intel xeon 6980P. Common Recipe
Test Cases
👈 Perplexity CommandPerplexity#model=/mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/DeepSeek-R1-0528-mix-IQ4_KT-og.gguf
model=/mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/DeepSeek-R1-0528-mix-IQ4_KT-0xCBAC1FED.gguf
numactl -N 0 -m 0 \
./build/bin/llama-perplexity \
--model "$model" \
-ctk f16 \
-mla 3 -fa \
-amb 512 \
-fmoe \
-f wiki.test.raw \
--seed 1337 \
--no-mmap \
--threads 128 \
--numa numactl llama-sweep-bench👈 llama-sweep-bench logs#!/usr/bin/env bash
#model=/mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/DeepSeek-R1-0528-mix-IQ4_KT-0xCBAC1FED.gguf
model=/mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/DeepSeek-R1-0528-mix-IQ4_KT-og.gguf
numactl -N 1 -m 1 \
./build/bin/llama-sweep-bench \
--model "$model" \
-c 8704 \
-ctk f16 \
-mla 3 -fa \
-fmoe \
--no-mmap \
--threads 80 \
--threads-batch 128 \
--numa numactl \
--warmup-batch DeepSeek-R1-0528-mix-IQ4_KT-og b1416bf
DeepSeek-R1-0528-mix-IQ4_KT-0xCBAC1FED e5a0668
ConclusionHuh, the perplexity of ~13.2 on both of these seems surprisingly "bad" relative to my earlier first test I had done with a smaller mix
I did confirm that both these new test case models give a reasonable answer for my usual Not sure what to try next... Possibly try comparing perplexity of a smaller quant on this PR529 vs older PR484 with exact same recipe? Maybe the new recipe is actually worse despite being larger? Happy to provide more logs as requested. Thanks! |
Okay, did one more faster experiment using the same recipe/imatrix for Qwen3-30B-A3B moe. Something is off between this PR529 and main's implementation of a "pure"
|
Okay, back to the basics as my sanity is thin. I used the Thread Ripper Pro 24x Core with RTX A6000 GPUs to test. tl;dr;The CUDA implementation of this PR529 seems to give reasonable perplexity. However compiling CPU-only gives much higher perplexity testing the same quant. Experiment
|
PPL = 922 means I have a bug in the CPU implementation. I haven't come around to check. |
All good no rush. Just wanted to re-create the issue on a "known working" system for my own peace of mind hah. If it is useful for anyone else testing, I'll make this experimental Qwen3-30B-A3B-IQ4_KT-PR529-e5a06688.gguf available from my personal server for a few days. ~15GiB with sha256sum |
The CPU bug is fixed now. I get quite a bit lower PPL using
(didn't want to risk something going wrong in the output tensor or the token embeddings)
|
Aye, that did the trick for qwen3moe:
I'll come back around with some more results soon thanks! |
But why is your PPL so much higher? |
This was my quant from yesterday "pure"
I'll use your command with my imatrix now and test again.
I'm assuming the higher bpw output/token_embd accounts for most of the discrepancy. UPDATE Results with the IQ4_KT using q8_0 for embedding/output are still higher for me. Discrepency could be because you use the unsloth imatrix dat. My imatrix dat is older using only imatrix calibration_data_v5_rc.txt. My newer imatrix corpus adds extra data in an attempt to activate more experts, but I never went back and updated my Qwen3-30B-A3Bs with it. I believe both unsloth and bartowski used bigger corpus for qwen3moe due to issues quantizing at lower BPW with their usual corpus text.
|
Must be the imatrix, then. I used the one from Unsloth, which produced the lowest PPL in my Qwen3 quantization experiments (#359) |
llama-perplexity -m Configurable-Llama-3.1-8B-Instruct_iMat-IQ3_KT_Nv2_embed_q6_0_output&attn_v_iq5ksr4_attn_k_iq4ksr4.gguf -f wiki.test.raw -ngl 150 -b 512 -mg 0 -ts 40,0,0 --no-mmap -fa -c 512 Final estimate: PPL = 8.1431 +/- 0.05213 IQ3_KT's PPL works for me on CUDA. It also infers on both CPU and CUDA. llama-perplexity -m Configurable-Llama-3.1-8B-Instruct_iMat-IQ3_XXS_embed_q6_0_output&attn_v_iq5ksr4_attn_k_iq4ksr4.gguf -f wiki.test.raw -ngl 150 -b 512 -mg 0 -ts 40,0,0 --no-mmap -fa -c 512 Final estimate: PPL = 8.4642 +/- 0.05423 IQ3_XXS has some serious competition, quant quality wise. Same recipe, but with IQ3_S tensors instead of IQ3_KT/IQ3_XXS : Final estimate: PPL = 7.9331 +/- 0.05065 Note: this version of Llama 8B gives a PPL of 7.3287 +/- 0.04703 for Q8_0, so very close to the original. |
I saw this and started cooking asap targeting ~3.5bpw for some recent requests on 🤗 . Not releasing anything yet, just experimenting for funzies.
About the largest size quant fitting 256GB RAM ~48+GB VRAM rigs. I'm offloading additional 7 or 8 👈 2x GPU offload Perplexity Command./build/bin/llama-perplexity \
--model "$model" \
-f wiki.test.raw \
--seed 1337 \
-ctk f16 \
-mla 3 -fa \
-fmoe \
-amb 512 \
-ngl 99 \
-ot "blk\.(3|4|5|6|7|8|9|10)\.ffn_.*=CUDA0" \
-ot "blk\.(11|12|13|14|15|16|17|18)\.ffn_.*=CUDA1" \
-ot exps=CPU \
--threads 24
👈 llama-sweep-bench-data and screenshot
cmake -B ./build -DGGML_CUDA=ON -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_IQK_FORCE_BF16=1 -DGGML_CUDA_F16=ON
cmake --build ./build --config Release -j $(nproc)
model=/mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/DeepSeek-R1-0528-IQ3_KT.gguf
./build/bin/llama-sweep-bench \
--model "$model" \
--ctx-size 20480 \
-ctk f16 \
-mla 3 -fa \
-fmoe \
-amb 512 \
-ngl 99 \
-ot "blk\.(3|4|5|6|7|8|9|10)\.ffn_.*=CUDA0" \
-ot "blk\.(11|12|13|14|15|16|17|18)\.ffn_.*=CUDA1" \
-ot exps=CPU \
-ub 2048 -b 2048 \
--warmup-batch \
--threads 24 16 exps offload default batches
16 exps offload 2048 batches
14 exps offload 4096 batches
14 exps offload 8192 batches
|
The new trellis generates int8_t values via sum_as_uint8_t[(ka * idx + kb) & 0x3f33f3f3f] - 126. CUDA dequantize works. AVX2 case Ny > 32 works, and we get 273 t/s for L3-8B. PPL is on par or even slightly lower than original QTIP trellis.
We get 13.6 t/s vs 8.4 t/s with the f16 trellis and f32 arithmetic. Still somewhat slower than other quants, but no longer pathetic.
We get very respectable PP-512 = 120 t/s. TG-128 is pathetic at 5.3 t/s, so 20+% slower than the f16 variant.
We are now at 9.4 t/s, up from 6.6 t/s for the f16 trellis.
It seems Apple Silicon cannot quickly add 4 8-bit ints. Or I don't know how to do it - but I didn't find anything in the Metal Shading Language Specification. So, performance is quite a bit worse than the original trellis.
c14cb2d
to
6da5afa
Compare
Time to merge this. |
This PR is the combination of #505 and #511, but rebased on current main, and using @louiehelm's alternative multiplier (see comments in #511).
I was curios to see if not having an extra addition per step when generating the trellis sequence will have a pefromance impact, so made a proper change rather than just blindly replacing the two constants using
sed
. On CUDA performance impact is negligible, onAVX2
we see 1-2% improvement.With the latest commits I have also adapted
IQ3_KT
to the integer trellis.