Skip to content

Conversation

xctan
Copy link
Collaborator

@xctan xctan commented Sep 1, 2025

This PR introduces performance optimizations for some RISC-V kernels and expands hardware support by enabling half-precision extensions.

Using the perf profiler, I identified significant performance bottlenecks caused by pipeline stalls. The following 128-bit RVV kernels have been rewritten to resolve these issues:

  • ggml_vec_dot_q4_K_q8_K
  • ggml_vec_dot_q6_K_q8_K

To facilitate intermediate results using half-precision floats, this PR enables the zvfh extension and adds implementations for several performance-critical kernels.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Sep 1, 2025
@xctan
Copy link
Collaborator Author

xctan commented Sep 1, 2025

Performance

Benchmark model: unsloth/Qwen3-4B-Instruct-2507-GGUF (ModelScope Hugging Face)

model size params backend threads test t/s % branch
qwen3 4B Q4_K - Medium 2.32 GiB 4.02 B CPU 64 pp512 68.40 ± 0.74 183% PR
qwen3 4B Q4_K - Medium 2.32 GiB 4.02 B CPU 64 pp512 37.30 ± 0.35 master
qwen3 4B Q4_K - Medium 2.32 GiB 4.02 B CPU 64 tg128 20.24 ± 2.45 177% PR
qwen3 4B Q4_K - Medium 2.32 GiB 4.02 B CPU 64 tg128 11.41 ± 0.77 master

Validation

Test model: unsloth/Qwen3-0.6B-GGUF (ModelScope Hugging Face)

llama-perplexity -m Qwen3-0.6B-Q4_K_M.gguf -f wikitext-2-raw/wiki.test.raw
branch perplexity
master 22.8862 +/- 0.20017
PR 22.8811 +/- 0.20010

@xctan xctan merged commit 05c0380 into ggml-org:master Sep 3, 2025
48 checks passed
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Sep 4, 2025
…upport

* origin/master: (72 commits)
metal : Add template specialization for mul_mm_id w/ ne20 == 10 (ggml-org#15799)
llama : set n_outputs to 1 to avoid 0 outputs mean-pooling (ggml-org#15791)
CANN: Refactor ND to NZ workspace to be per-device (ggml-org#15763)
server: add exceed_context_size_error type (ggml-org#15780)
Document the new max GPU layers default in help (ggml-org#15771)
ggml: add ops for WAN video model (cuda && cpu) (ggml-org#15669)
CANN: Fix precision issue on 310I DUO multi-devices (ggml-org#15784)
opencl: add hs=40 to FA (ggml-org#15758)
CANN: fix acl_rstd allocation size in ggml_cann_rms_norm (ggml-org#15760)
vulkan: fix mmv subgroup16 selection (ggml-org#15775)
vulkan: don't use std::string in load_shaders, to improve compile time (ggml-org#15724)
vulkan : update ggml_vk_instance_validation_ext_available (ggml-org#15666)
ggml vulkan: add hardsigmoid and hardswish operations (ggml-org#15762)
CUDA: Optimize `rms_norm_f32` kernel and its fused variants, giving 1-6% perf E2E (ggml-org#15715)
model-conversion : fix pyright errors (ggml-org#15770)
sampling : optimize dist sampler (ggml-org#15704)
llama : fix incorrect model type for Gemma 270M (ggml-org#15764)
model-conversion : remove hardcoded /bin/bash shebangs [no ci] (ggml-org#15765)
CANN: Add RoPE contiguous check for 310I DUP device (ggml-org#15735)
ggml-cpu : optimize RVV kernels (ggml-org#15720)
...
walidbr pushed a commit to walidbr/llama.cpp that referenced this pull request Sep 7, 2025
* ggml-cpu : optimize rvv ggml_vec_dot_f32

* ggml-cpu : optimize 128-bit rvv ggml_vec_dot_q4_K_q8_K

* ggml-cpu : fix riscv arch flags

* ggml-cpu : add more rvv ops

* ggml-cpu : optimize rvv ggml_vec_dot_q4_K_q8_K

* ggml-cpu : optimize rvv ggml_vec_dot_q6_K_q8_K

* ggml-cpu : minor rvv adjustments

* ggml-cpu : fix riscv include
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants