Bug: Heavy throttling during token generation on Apple Silicon

### What happened?

There is heavy throttling during token generation on Apple Silicon. The machine tested is MacBook Pro 14" M3 Max with 128 GB memory. In my experience, throttling occurs more often with larger models (≥70B). [Qwen 72B Q4_0 GGUF](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct-GGUF) is tested in this case, although throttling does not happen exclusively with this model.

The tests were performed under high-power mode with the original 96W adapter plugged in, to ensure that the machine is not power limited. The max core temperature during throttling (middle of the 4th run in this case) hovered between 60-70°C, meaning the throttling should not be due to thermal limitations. I have experienced this issue for months across many different versions of llama.cpp, so it is not version specific.

### Name and Version

version: 4104 (0fff7fd7)
built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin24.1.0

### What operating system are you seeing the problem on?

Mac

### Relevant log output

```shell
Steps to reproduce:
./llama-bench -m qwen2.5-72b-instruct-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
sleep 10
./llama-bench -m qwen2.5-72b-instruct-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
sleep 10
... (repeated)

Results:
| model                          |       size |     params | backend    | threads | fa | mmap |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ---: | ------------: | -------------------: |
| qwen2 70B Q4_0                 |  38.53 GiB |    72.96 B | Metal,BLAS |      12 |  1 |    0 |          tg32 |          8.68 ± 0.07 |
| qwen2 70B Q4_0                 |  38.53 GiB |    72.96 B | Metal,BLAS |      12 |  1 |    0 |          tg32 |          8.60 ± 0.15 |
| qwen2 70B Q4_0                 |  38.53 GiB |    72.96 B | Metal,BLAS |      12 |  1 |    0 |          tg32 |          8.51 ± 0.11 |
| qwen2 70B Q4_0                 |  38.53 GiB |    72.96 B | Metal,BLAS |      12 |  1 |    0 |          tg32 |          5.40 ± 1.10 |
| qwen2 70B Q4_0                 |  38.53 GiB |    72.96 B | Metal,BLAS |      12 |  1 |    0 |          tg32 |          2.67 ± 0.58 |
| qwen2 70B Q4_0                 |  38.53 GiB |    72.96 B | Metal,BLAS |      12 |  1 |    0 |          tg32 |          2.37 ± 0.58 |

build: 0fff7fd7 (4104)
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: Heavy throttling during token generation on Apple Silicon #10444

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: Heavy throttling during token generation on Apple Silicon #10444

Description

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions