Allow q8_0 cache on the CPU for FlashMLA-2 #265

ikawrakow · 2025-03-18T12:30:47Z

Somehow I had the concept that Q8_0 KV cache is working for CPU-only inference with FlashMLA-2. Indeed it is for prompt processing, but not for TG (two different paths are taken). Clearly too many options as I'm getting confused myself. Anyhow, this PR adds the missing Q8_0 -> Q8_0 contiguous transpose operation, so now we can use Q8_0 KV cache with FlashMLA-2 also on the CPU.

I broke it with PR #265. I was testing with a model where the wk_b and wk_v tensors were present, so didn't need to be computed, so didn't notice that the change I made to ggml_compute_forward_dup_q breaks that computation.

I broke it with PR #265. I was testing with a model where the wk_b and wk_v tensors were present, so didn't need to be computed, so didn't notice that the change I made to ggml_compute_forward_dup_q breaks that computation. Co-authored-by: Iwan Kawrakow <[email protected]>

Allow q8_0 cache on the CPU for FlashMLA-2

96d1235

ikawrakow merged commit 8e549b4 into main Mar 18, 2025

ikawrakow mentioned this pull request Mar 19, 2025

Fix ggml_compute_forward_dup_q #269

Merged

ubergarm mentioned this pull request Mar 19, 2025

Possible regression computing wk_b tensors on the fly after PR #265 #271

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow q8_0 cache on the CPU for FlashMLA-2 #265

Allow q8_0 cache on the CPU for FlashMLA-2 #265

Uh oh!

ikawrakow commented Mar 18, 2025

Uh oh!

Uh oh!

Allow q8_0 cache on the CPU for FlashMLA-2 #265

Allow q8_0 cache on the CPU for FlashMLA-2 #265

Uh oh!

Conversation

ikawrakow commented Mar 18, 2025

Uh oh!

Uh oh!