WIP: llama: Vulkan: Fix Adreno Q8_0 issues. #11

infinitalo · 2025-08-29T13:37:13Z

This MR is a work-in-progress.

The current commits are able to get inference working for Q8_0 on Adreno 830 (Samsung S25), but finetuning still crashes.

We're currently working on a fix for lora-finetuning on Adreno A830, but you can use this for testing in the meanwhile.

…a is provided

Signed-off-by: vineet <[email protected]>

…lation Signed-off-by: vineet <[email protected]>

This fixes the vkDeviceLostError on Mali

This makes MUL_MAT tests pass for Q8_0 when n=9 failed.

infinitalo · 2025-09-01T15:12:45Z

Steps to run the backend-ops test suite:

Set up your Android environment for testing llama.cpp. You can use this comment as a reference if you haven't built it already: Add initial LoRA finetuning support; vulkan OUT_PROD; vulkan cross-entropy-backward #5 (comment)
Configure your build with: cmake -B build -DGGML_VULKAN=1 -DCMAKE_BUILD_TYPE=Debug -DBUILD_TESTING=ON
Build llama.cpp: cmake --build build --config Debug -j2
Run the backend-ops tests: ./build/bin/test-backend-ops
You can also run tests for specific operators with the -o option, for example: ./build/bin/test-backend-ops -o MUL_MAT

This PR has a commit disabling several tests for quantized datatypes that are not currently working properly on Adreno 830.

If you run the test suite as described above with this branch, it should say 2/2 backends passing at the end, with no failing tests on A830, as the attached file shows.

test_adreno_q8_inf2.txt

* oai moe * compat with new checkpoint * add attn sink impl * add rope scaling yarn * logits match with latest transformers code * wip chat template * rm trailing space * use ggml_scale_bias * rm redundant is_swa_all * convert interleaved gate_up * graph : fix activation function to match reference (#7) * vocab : handle o200k_harmony special tokens * ggml : add attention sinks support (#1) * llama : add attn sinks * ggml : add attn sinks * cuda : add attn sinks * vulkan : add support for sinks in softmax remove unnecessary return * ggml : add fused swiglu_oai op (#11) * ggml : add fused swiglu_oai op * Update ggml/src/ggml-cpu/ops.cpp Co-authored-by: Georgi Gerganov <[email protected]> * update CUDA impl * cont : metal impl * add vulkan impl * test-backend-ops : more test cases, clean up * llama : remove unfused impl * remove extra lines --------- Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: slaren <[email protected]> * repack mxfp4 upon conversion * clean up a bit * enable thinking * add quick hack to render only some special tokens * fix bf16 conversion * remove vocab hack * webui ok * support chat parsing for gpt-oss * fix webui * direct mapping mxfp4, FINALLY * force using mxfp4 * properly use lazy tensor * ggml : add mxfp4 ggml : use e8m0 conversion instead of powf Co-authored-by: Diego Devesa <[email protected]> change kvalues_mxfp4 table to match e2m1 (#6) metal : remove quantization for now (not used) cuda : fix disabled CUDA graphs due to ffn moe bias vulkan : add support for mxfp4 cont : add cm2 dequant * ggml : add ggml_add_id (#13) * ggml : add ggml_add_id * add cuda impl * llama : add weight support check for add_id * perf opt * add vulkan impl * rename cuda files * add metal impl * allow in-place ggml_add_id * llama : keep biases on CPU with --cpu-moe * llama : fix compile error ggml-ci * cuda : add fallback for __nv_cvt_e8m0_to_bf16raw ggml-ci * cleanup ggml-ci * sycl : fix supports_op for MXFP4 ggml-ci * fix Unknown reasoning format * ggml-cpu : fix AVX build ggml-ci * fix hip build ggml-ci * cuda : add mxfp4 dequantization support for cuBLAS ggml-ci * ggml-cpu : fix mxfp4 fallback definitions for some architectures ggml-ci * cuda : fix version required for __nv_cvt_e8m0_to_bf16raw --------- Co-authored-by: Xuan Son Nguyen <[email protected]> Co-authored-by: slaren <[email protected]>

makaveli10 and others added 20 commits August 19, 2025 10:07

Add lora finetuning from adapter

f7b0025

Add: create new lora adapter for target modules to finetune if no lor…

116f3dd

…a is provided

Fix identical loss over epochs; fix garbage lora initization

9e6d8ce

Signed-off-by: vineet <[email protected]>

Remove lora training from finetune.cpp

8bb11c0

Signed-off-by: vineet <[email protected]>

Add adapter saving & other lora target modules

486ebc1

Signed-off-by: vineet <[email protected]>

Add finetune-lora for lora finetuning in examples

c23ada9

Signed-off-by: vineet <[email protected]>

Add dequantization to out_prod cuda kernel

3f295e1

Signed-off-by: vineet <[email protected]>

Update README with finetune-lora

0c1ffd1

Signed-off-by: vineet <[email protected]>

Vulkan: add support for fp32 OUT_PROD op

e9f5d88

CPU: add support for fp16_fp32 OUT_PROD op

fb0e501

Vulkan: add support for f16_f32 OUT_PROD op

2b0c835

Vulkan: Add Q4_0/Q8_0 OUT_PROD Vulkan support

0aef6c8

vulkan: Add initial cross entropy loss backward shader

25c5316

Signed-off-by: vineet <[email protected]>

vulkan: Fix cross-entropy-loss-back dispatch size and wg denominator

0721550

Signed-off-by: vineet <[email protected]>

vulkan: Change uint32 cast to int32 for outprod; allows android compi…

bc7dd9f

…lation Signed-off-by: vineet <[email protected]>

vulkan: Deallocate memory after destroying buffer

c36aeee

vulkan: Set specialization constants to { 0 } for out_prod

1709861

This fixes the vkDeviceLostError on Mali

vulkan: Set out_prod pipeline disable_robustness to true

b0c5b5b

Fix out_prod; vulkan ci issues

075d1cb

Add GEGLU backward (Vulkan) to enable Gemma training.

191dd7e

github-actions bot added Nvidia GPU Vulkan examples ggml testing labels Aug 29, 2025

Italo Nicola added 5 commits September 1, 2025 10:57

(wip) Vulkan: remove packed16 optimization for Q8_0 dequant4

3ccd40a

(wip) Vulkan: disable packed16 optimizations for Q8_0 src0

089377d

(wip) Vulkan: disable mulmat device->integer_dot_product optimization

d127984

This makes MUL_MAT tests pass for Q8_0 when n=9 failed.

(wip) Vulkan: remove [[unroll]] and dot() calls in mul_mat_vec shader

0547341

(wip) Vulkan: stop using data_b_v4 in mul_mat_vec shader for Q8_0

4bf9f07

Italo Nicola added 5 commits September 1, 2025 10:57

(wip) Vulkan: severely reduce threshold needed for submitting mul_mat

9c20a4c

(wip) Vulkan: Disable device->subgroup_size_control

642ea3b

(wip) Vulkan: disable device->integer_dot_product

405c90c

(wip) Vulkan: disable COOPMAT support

58d3c68

(wip) Tests: disable non-q8 quant tests

208747f

infinitalo force-pushed the italo/tether/adreno_q8_inference branch 2 times, most recently from cbea88f to 208747f Compare September 1, 2025 14:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: llama: Vulkan: Fix Adreno Q8_0 issues. #11

WIP: llama: Vulkan: Fix Adreno Q8_0 issues. #11

Uh oh!

infinitalo commented Aug 29, 2025

Uh oh!

infinitalo commented Sep 1, 2025 •

edited

Loading

Uh oh!

Uh oh!

WIP: llama: Vulkan: Fix Adreno Q8_0 issues. #11

Are you sure you want to change the base?

WIP: llama: Vulkan: Fix Adreno Q8_0 issues. #11

Uh oh!

Conversation

infinitalo commented Aug 29, 2025

Uh oh!

infinitalo commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

infinitalo commented Sep 1, 2025 •

edited

Loading