Bug: The streaming every couple of rows blocks for 5-8s

### What happened?

Although I obtained good sweep-bench results for 235b UD_Q5_XL as shown below, and with the q4 quant they were 20% faster, in both cases, this annoying blocking happens every couple of rows. I tried changing from 16 threads to 12, but same thing happens. Wilth main llama, is like 25% slower, but is cursive.
My system is a TR 3955wx with 16 cores, 256 ddr4 3200, 2x3090..
Any ideas?   
./build/bin/llama-sweep-bench --model /home/ciprian/ai/models/Qwen3-235B-UD_Q5_XL/Qwen3-235B-A22B-UD-Q5_K_XL-00001-of-00004.gguf --alias Qwen3-235B-A22B-UD-Q5_K_XL -fa -fmoe  -ctk q8_0 -ctv q8_0 -c 40960  --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0 --presence-penalty 0.5 -ot "blk\.[0-9]\.ffn_up_exps=CUDA0,blk\.[0-9]\.ffn_gate_exps=CUDA0,blk\.2[0-4]\.ffn_up_exps=CUDA0,blk\.2[0-4]\.ffn_gate_exps=CUDA0,blk\.1[0-9]\.ffn_up_exps=CUDA1,blk\.1[0-9]\.ffn_gate_exps=CUDA1,blk\.2[5-8]\.ffn_up_exps=CUDA1,blk\.2[5-8]\.ffn_gate_exps=CUDA1,exps=CPU"  -ngl 99 --threads 16 --host 0.0.0.0 --port 5002    --ubatch-size 4096 --batch-size 4096
|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  4096 |   1024 |      0 |   11.730 |   349.19 |  133.500 |     7.67 |
|  4096 |   1024 |   4096 |   12.079 |   339.11 |  136.944 |     7.48 |
|  4096 |   1024 |   8192 |   12.514 |   327.33 |  140.286 |     7.30 |
|  4096 |   1024 |  12288 |   13.038 |   314.17 |  144.478 |     7.09 |
|  4096 |   1024 |  16384 |   13.545 |   302.40 |  148.595 |     6.89 |
|  4096 |   1024 |  20480 |   13.943 |   293.76 |  151.881 |     6.74 |
|  4096 |   1024 |  24576 |   14.767 |   277.38 |  154.643 |     6.62 |
|  4096 |   1024 |  28672 |   15.621 |   262.21 |  158.355 |     6.47 |
|  4096 |   1024 |  32768 |   16.561 |   247.32 |  161.875 |     6.33 |
|  4096 |   1024 |  36864 |   17.658 |   231.97 |  166.160 |     6.16 |

### Name and Version

llama-server -model /home/ciprian/ai/models/Qwen3-235B-UD_Q5_XL/Qwen3-235B-A22B-UD-Q5_K_XL-00001-of-00004.gguf --alias Qwen3-235B-A22B-UD-Q5_K_XL -fa -fmoe  -ctk q8_0 -ctv q8_0 -c 40960  --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0 --presence-penalty 0.5 -ot "blk\.[0-9]\.ffn_up_exps=CUDA0,blk\.[0-9]\.ffn_gate_exps=CUDA0,blk\.2[0-4]\.ffn_up_exps=CUDA0,blk\.2[0-4]\.ffn_gate_exps=CUDA0,blk\.1[0-9]\.ffn_up_exps=CUDA1,blk\.1[0-9]\.ffn_gate_exps=CUDA1,blk\.2[5-8]\.ffn_up_exps=CUDA1,blk\.2[5-8]\.ffn_gate_exps=CUDA1,exps=CPU"  -ngl 99 --threads 16 --host 0.0.0.0 --port 5002    --ubatch-size 4096 --batch-size 4096

### What operating system are you seeing the problem on?

_No response_

### Relevant log output

```shell

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: The streaming every couple of rows blocks for 5-8s #464

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	1024	0	11.730	349.19	133.500	7.67
4096	1024	4096	12.079	339.11	136.944	7.48
4096	1024	8192	12.514	327.33	140.286	7.30
4096	1024	12288	13.038	314.17	144.478	7.09
4096	1024	16384	13.545	302.40	148.595	6.89
4096	1024	20480	13.943	293.76	151.881	6.74
4096	1024	24576	14.767	277.38	154.643	6.62
4096	1024	28672	15.621	262.21	158.355	6.47
4096	1024	32768	16.561	247.32	161.875	6.33
4096	1024	36864	17.658	231.97	166.160	6.16

Bug: The streaming every couple of rows blocks for 5-8s #464

Description

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions