Skip to content

Conversation

ikawrakow
Copy link
Owner

Somehow I had the concept that Q8_0 KV cache is working for CPU-only inference with FlashMLA-2. Indeed it is for prompt processing, but not for TG (two different paths are taken). Clearly too many options as I'm getting confused myself. Anyhow, this PR adds the missing Q8_0 -> Q8_0 contiguous transpose operation, so now we can use Q8_0 KV cache with FlashMLA-2 also on the CPU.

@ikawrakow ikawrakow merged commit 8e549b4 into main Mar 18, 2025
ikawrakow pushed a commit that referenced this pull request Mar 19, 2025
I broke it with PR #265. I was testing with a model where
the wk_b and wk_v tensors were present, so didn't need to be computed,
so didn't notice that the change I made to ggml_compute_forward_dup_q
breaks that computation.
ikawrakow added a commit that referenced this pull request Mar 19, 2025
I broke it with PR #265. I was testing with a model where
the wk_b and wk_v tensors were present, so didn't need to be computed,
so didn't notice that the change I made to ggml_compute_forward_dup_q
breaks that computation.

Co-authored-by: Iwan Kawrakow <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant