-
Notifications
You must be signed in to change notification settings - Fork 13k
Description
What happened?
There is heavy throttling during token generation on Apple Silicon. The machine tested is MacBook Pro 14" M3 Max with 128 GB memory. In my experience, throttling occurs more often with larger models (≥70B). Qwen 72B Q4_0 GGUF is tested in this case, although throttling does not happen exclusively with this model.
The tests were performed under high-power mode with the original 96W adapter plugged in, to ensure that the machine is not power limited. The max core temperature during throttling (middle of the 4th run in this case) hovered between 60-70°C, meaning the throttling should not be due to thermal limitations. I have experienced this issue for months across many different versions of llama.cpp, so it is not version specific.
Name and Version
version: 4104 (0fff7fd)
built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin24.1.0
What operating system are you seeing the problem on?
Mac
Relevant log output
Steps to reproduce:
./llama-bench -m qwen2.5-72b-instruct-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
sleep 10
./llama-bench -m qwen2.5-72b-instruct-q4_0.gguf -mmp 0 -fa 1 -p 0 -n 32
sleep 10
... (repeated)
Results:
| model | size | params | backend | threads | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ---: | ------------: | -------------------: |
| qwen2 70B Q4_0 | 38.53 GiB | 72.96 B | Metal,BLAS | 12 | 1 | 0 | tg32 | 8.68 ± 0.07 |
| qwen2 70B Q4_0 | 38.53 GiB | 72.96 B | Metal,BLAS | 12 | 1 | 0 | tg32 | 8.60 ± 0.15 |
| qwen2 70B Q4_0 | 38.53 GiB | 72.96 B | Metal,BLAS | 12 | 1 | 0 | tg32 | 8.51 ± 0.11 |
| qwen2 70B Q4_0 | 38.53 GiB | 72.96 B | Metal,BLAS | 12 | 1 | 0 | tg32 | 5.40 ± 1.10 |
| qwen2 70B Q4_0 | 38.53 GiB | 72.96 B | Metal,BLAS | 12 | 1 | 0 | tg32 | 2.67 ± 0.58 |
| qwen2 70B Q4_0 | 38.53 GiB | 72.96 B | Metal,BLAS | 12 | 1 | 0 | tg32 | 2.37 ± 0.58 |
build: 0fff7fd7 (4104)