Skip to content

Bug: Heavy throttling during token generation on Apple Silicon #10444

@Azirine

Description

@Azirine

What happened?

There is heavy throttling during token generation on Apple Silicon. The machine tested is MacBook Pro 14" M3 Max with 128 GB memory. In my experience, throttling occurs more often with larger models (≥70B). Qwen 72B Q4_0 GGUF is tested in this case, although throttling does not happen exclusively with this model.

The tests were performed under high-power mode with the original 96W adapter plugged in, to ensure that the machine is not power limited. The max core temperature during throttling (middle of the 4th run in this case) hovered between 60-70°C, meaning the throttling should not be due to thermal limitations. I have experienced this issue for months across many different versions of llama.cpp, so it is not version specific.

Name and Version

version: 4104 (0fff7fd)
built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin24.1.0

What operating system are you seeing the problem on?

Mac

Relevant log output

Steps to reproduce:
./llama-bench -m qwen2.5-72b-instruct-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
sleep 10
./llama-bench -m qwen2.5-72b-instruct-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
sleep 10
... (repeated)

Results:
| model                          |       size |     params | backend    | threads | fa | mmap |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ---: | ------------: | -------------------: |
| qwen2 70B Q4_0                 |  38.53 GiB |    72.96 B | Metal,BLAS |      12 |  1 |    0 |          tg32 |          8.68 ± 0.07 |
| qwen2 70B Q4_0                 |  38.53 GiB |    72.96 B | Metal,BLAS |      12 |  1 |    0 |          tg32 |          8.60 ± 0.15 |
| qwen2 70B Q4_0                 |  38.53 GiB |    72.96 B | Metal,BLAS |      12 |  1 |    0 |          tg32 |          8.51 ± 0.11 |
| qwen2 70B Q4_0                 |  38.53 GiB |    72.96 B | Metal,BLAS |      12 |  1 |    0 |          tg32 |          5.40 ± 1.10 |
| qwen2 70B Q4_0                 |  38.53 GiB |    72.96 B | Metal,BLAS |      12 |  1 |    0 |          tg32 |          2.67 ± 0.58 |
| qwen2 70B Q4_0                 |  38.53 GiB |    72.96 B | Metal,BLAS |      12 |  1 |    0 |          tg32 |          2.37 ± 0.58 |

build: 0fff7fd7 (4104)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bug-unconfirmedmedium severityUsed to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)stale

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions