vulkan: mul_mat_id coopmat2 optimizations #15546

jeffbolznv · 2025-08-24T18:12:54Z

Add a path for when the tile fits in BN/2, similar to what we have for mul_mat.

Only call fetch_scales/store_scales once per QUANT_K block, and once at the beginning in case start_k is not aligned.

5090 before

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 0 -p 512 -r 100 --prio 1 -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           pp512 |      6293.79 ± 64.66 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           pp512 |      8554.05 ± 90.98 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           pp512 |     7622.73 ± 169.20 |

5090 after

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 0 -p 512 -r 100 --prio 1 -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           pp512 |      6565.40 ± 66.68 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           pp512 |      9846.74 ± 89.70 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           pp512 |     8160.11 ± 162.47 |

4070 before

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 0 -p 512 -r 100 --prio 1 -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           pp512 |      2250.45 ± 21.68 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           pp512 |      2721.79 ± 18.53 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           pp512 |      2991.03 ± 29.67 |

4070 after

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 0 -p 512 -r 100 --prio 1 -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           pp512 |      2402.63 ± 20.97 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           pp512 |      3107.93 ± 14.71 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           pp512 |      3199.04 ± 28.88 |

Add a path for when the tile fits in BN/2, similar to what we have for mul_mat. Only call fetch_scales/store_scales once per QUANT_K block, and once at the beginning in case start_k is not aligned.

0cc4m

gpu_info	backends	model_type	model_size	test	avg_ts(master)	avg_ts(pr)	%
NVIDIA GeForce RTX 3090	Vulkan	gpt-oss 20B Q8_0	11.27 GiB	pp512	3578.28	3842.33	+7.4%
NVIDIA GeForce RTX 3090	Vulkan	gpt-oss 20B Q8_0	11.27 GiB	tg128	148.56	147.85	-0.5%
NVIDIA GeForce RTX 3090	Vulkan	qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	pp512	2493.03	2805.21	+12.5%
NVIDIA GeForce RTX 3090	Vulkan	qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	tg128	141.80	140.94	-0.6%

Looks good.

This is a missing interaction between ggml-org#15546 and ggml-org#15652

This is a missing interaction between #15546 and #15652

* vulkan: mul_mat_id coopmat2 optimizations Add a path for when the tile fits in BN/2, similar to what we have for mul_mat. Only call fetch_scales/store_scales once per QUANT_K block, and once at the beginning in case start_k is not aligned. * Also add a path for BN/4 - worth a couple more percent

This is a missing interaction between ggml-org#15546 and ggml-org#15652

jeffbolznv requested a review from 0cc4m as a code owner August 24, 2025 18:12

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Aug 24, 2025

vulkan: mul_mat_id coopmat2 optimizations

73e07ad

Add a path for when the tile fits in BN/2, similar to what we have for mul_mat. Only call fetch_scales/store_scales once per QUANT_K block, and once at the beginning in case start_k is not aligned.

jeffbolznv force-pushed the mmid_bnover2 branch from b753596 to 73e07ad Compare August 29, 2025 18:30

Also add a path for BN/4 - worth a couple more percent

e5f97df

0cc4m approved these changes Aug 31, 2025

View reviewed changes

0cc4m merged commit c37052a into ggml-org:master Aug 31, 2025
48 checks passed

jeffbolznv added a commit to jeffbolznv/llama.cpp that referenced this pull request Aug 31, 2025

vulkan: add missing clamps in new mul_mat_id paths

c4ec430

This is a missing interaction between ggml-org#15546 and ggml-org#15652

jeffbolznv mentioned this pull request Aug 31, 2025

vulkan: add missing clamps in new mul_mat_id paths #15702

Merged

0cc4m pushed a commit that referenced this pull request Sep 1, 2025

vulkan: add missing clamps in new mul_mat_id paths (#15702)

35a42ed

This is a missing interaction between #15546 and #15652

walidbr pushed a commit to walidbr/llama.cpp that referenced this pull request Sep 7, 2025

vulkan: add missing clamps in new mul_mat_id paths (ggml-org#15702)

c18fc47

This is a missing interaction between ggml-org#15546 and ggml-org#15652

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vulkan: mul_mat_id coopmat2 optimizations #15546

vulkan: mul_mat_id coopmat2 optimizations #15546

Uh oh!

jeffbolznv commented Aug 24, 2025

Uh oh!

0cc4m left a comment

Uh oh!

Uh oh!

Uh oh!

vulkan: mul_mat_id coopmat2 optimizations #15546

vulkan: mul_mat_id coopmat2 optimizations #15546

Uh oh!

Conversation

jeffbolznv commented Aug 24, 2025

Uh oh!

0cc4m left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!