Skip to content

Bug: The streaming every couple of rows blocks for 5-8s #464

@ciprianveg

Description

@ciprianveg

What happened?

Although I obtained good sweep-bench results for 235b UD_Q5_XL as shown below, and with the q4 quant they were 20% faster, in both cases, this annoying blocking happens every couple of rows. I tried changing from 16 threads to 12, but same thing happens. Wilth main llama, is like 25% slower, but is cursive.
My system is a TR 3955wx with 16 cores, 256 ddr4 3200, 2x3090..
Any ideas?
./build/bin/llama-sweep-bench --model /home/ciprian/ai/models/Qwen3-235B-UD_Q5_XL/Qwen3-235B-A22B-UD-Q5_K_XL-00001-of-00004.gguf --alias Qwen3-235B-A22B-UD-Q5_K_XL -fa -fmoe -ctk q8_0 -ctv q8_0 -c 40960 --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0 --presence-penalty 0.5 -ot "blk.[0-9].ffn_up_exps=CUDA0,blk.[0-9].ffn_gate_exps=CUDA0,blk.2[0-4].ffn_up_exps=CUDA0,blk.2[0-4].ffn_gate_exps=CUDA0,blk.1[0-9].ffn_up_exps=CUDA1,blk.1[0-9].ffn_gate_exps=CUDA1,blk.2[5-8].ffn_up_exps=CUDA1,blk.2[5-8].ffn_gate_exps=CUDA1,exps=CPU" -ngl 99 --threads 16 --host 0.0.0.0 --port 5002 --ubatch-size 4096 --batch-size 4096

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 1024 0 11.730 349.19 133.500 7.67
4096 1024 4096 12.079 339.11 136.944 7.48
4096 1024 8192 12.514 327.33 140.286 7.30
4096 1024 12288 13.038 314.17 144.478 7.09
4096 1024 16384 13.545 302.40 148.595 6.89
4096 1024 20480 13.943 293.76 151.881 6.74
4096 1024 24576 14.767 277.38 154.643 6.62
4096 1024 28672 15.621 262.21 158.355 6.47
4096 1024 32768 16.561 247.32 161.875 6.33
4096 1024 36864 17.658 231.97 166.160 6.16

Name and Version

llama-server -model /home/ciprian/ai/models/Qwen3-235B-UD_Q5_XL/Qwen3-235B-A22B-UD-Q5_K_XL-00001-of-00004.gguf --alias Qwen3-235B-A22B-UD-Q5_K_XL -fa -fmoe -ctk q8_0 -ctv q8_0 -c 40960 --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0 --presence-penalty 0.5 -ot "blk.[0-9].ffn_up_exps=CUDA0,blk.[0-9].ffn_gate_exps=CUDA0,blk.2[0-4].ffn_up_exps=CUDA0,blk.2[0-4].ffn_gate_exps=CUDA0,blk.1[0-9].ffn_up_exps=CUDA1,blk.1[0-9].ffn_gate_exps=CUDA1,blk.2[5-8].ffn_up_exps=CUDA1,blk.2[5-8].ffn_gate_exps=CUDA1,exps=CPU" -ngl 99 --threads 16 --host 0.0.0.0 --port 5002 --ubatch-size 4096 --batch-size 4096

What operating system are you seeing the problem on?

No response

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions