-
Notifications
You must be signed in to change notification settings - Fork 142
Description
What happened?
Although I obtained good sweep-bench results for 235b UD_Q5_XL as shown below, and with the q4 quant they were 20% faster, in both cases, this annoying blocking happens every couple of rows. I tried changing from 16 threads to 12, but same thing happens. Wilth main llama, is like 25% slower, but is cursive.
My system is a TR 3955wx with 16 cores, 256 ddr4 3200, 2x3090..
Any ideas?
./build/bin/llama-sweep-bench --model /home/ciprian/ai/models/Qwen3-235B-UD_Q5_XL/Qwen3-235B-A22B-UD-Q5_K_XL-00001-of-00004.gguf --alias Qwen3-235B-A22B-UD-Q5_K_XL -fa -fmoe -ctk q8_0 -ctv q8_0 -c 40960 --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0 --presence-penalty 0.5 -ot "blk.[0-9].ffn_up_exps=CUDA0,blk.[0-9].ffn_gate_exps=CUDA0,blk.2[0-4].ffn_up_exps=CUDA0,blk.2[0-4].ffn_gate_exps=CUDA0,blk.1[0-9].ffn_up_exps=CUDA1,blk.1[0-9].ffn_gate_exps=CUDA1,blk.2[5-8].ffn_up_exps=CUDA1,blk.2[5-8].ffn_gate_exps=CUDA1,exps=CPU" -ngl 99 --threads 16 --host 0.0.0.0 --port 5002 --ubatch-size 4096 --batch-size 4096
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
4096 | 1024 | 0 | 11.730 | 349.19 | 133.500 | 7.67 |
4096 | 1024 | 4096 | 12.079 | 339.11 | 136.944 | 7.48 |
4096 | 1024 | 8192 | 12.514 | 327.33 | 140.286 | 7.30 |
4096 | 1024 | 12288 | 13.038 | 314.17 | 144.478 | 7.09 |
4096 | 1024 | 16384 | 13.545 | 302.40 | 148.595 | 6.89 |
4096 | 1024 | 20480 | 13.943 | 293.76 | 151.881 | 6.74 |
4096 | 1024 | 24576 | 14.767 | 277.38 | 154.643 | 6.62 |
4096 | 1024 | 28672 | 15.621 | 262.21 | 158.355 | 6.47 |
4096 | 1024 | 32768 | 16.561 | 247.32 | 161.875 | 6.33 |
4096 | 1024 | 36864 | 17.658 | 231.97 | 166.160 | 6.16 |
Name and Version
llama-server -model /home/ciprian/ai/models/Qwen3-235B-UD_Q5_XL/Qwen3-235B-A22B-UD-Q5_K_XL-00001-of-00004.gguf --alias Qwen3-235B-A22B-UD-Q5_K_XL -fa -fmoe -ctk q8_0 -ctv q8_0 -c 40960 --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0 --presence-penalty 0.5 -ot "blk.[0-9].ffn_up_exps=CUDA0,blk.[0-9].ffn_gate_exps=CUDA0,blk.2[0-4].ffn_up_exps=CUDA0,blk.2[0-4].ffn_gate_exps=CUDA0,blk.1[0-9].ffn_up_exps=CUDA1,blk.1[0-9].ffn_gate_exps=CUDA1,blk.2[5-8].ffn_up_exps=CUDA1,blk.2[5-8].ffn_gate_exps=CUDA1,exps=CPU" -ngl 99 --threads 16 --host 0.0.0.0 --port 5002 --ubatch-size 4096 --batch-size 4096
What operating system are you seeing the problem on?
No response