Qwen3-30B-A3B-Instruct-2507-Q4_K_M-GGUF woes 😵‍💫 #16253

cristianadam · 2025-09-25T13:53:07Z

cristianadam
Sep 25, 2025

I wanted to run https://huggingface.co/ggml-org/Qwen3-30B-A3B-Instruct-2507-Q8_0-GGUF on a MacBook Pro M3 with 36GB and noticed that Q8_0 is a bit too much.

Since the model was built with gguf-my-repo I tried to create my own Q4_K_M version at https://huggingface.co/cristianadam/Qwen3-30B-A3B-Instruct-2507-Q4_K_M-GGUF

But when I load it with:

% llama-server --version
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.017 sec
ggml_metal_device_init: GPU name:   Apple M3 Pro
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 28991.03 MB
version: 6550 (3ecb2f67)
built with Apple clang version 17.0.0 (clang-1700.0.13.3) for arm64-apple-darwin24.4.0

% llama-server -hf cristianadam/Qwen3-30B-A3B-Instruct-2507-Q4_K_M-GGUF:Q4_K_M -c 0 --jinja

I get this log.txt
and via web page the following:

What am I doing wrong? How can I fix it?

Answered by ggerganov

Sep 25, 2025

You are running out of memory because the context of this model is 256k and requires ~25GB. You should be able to run with -c 32768 and probably higher depending on how much free memory you have.

View full answer

ggerganov · 2025-09-25T14:23:47Z

ggerganov
Sep 25, 2025
Maintainer

You are running out of memory because the context of this model is 256k and requires ~25GB. You should be able to run with -c 32768 and probably higher depending on how much free memory you have.

3 replies

cristianadam Sep 25, 2025
Author

Thank you. I was thinking about the context size. -c 32768 works!

cristianadam Sep 25, 2025
Author

-c 65536 also works. Now I have a model for both fim and chat that works on this macbook pro! 🎉

ggerganov Sep 25, 2025
Maintainer

There is also https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct which might be more suitable for coding.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Qwen3-30B-A3B-Instruct-2507-Q4_K_M-GGUF woes 😵‍💫 #16253

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Qwen3-30B-A3B-Instruct-2507-Q4_K_M-GGUF woes 😵‍💫 #16253

Uh oh!

cristianadam Sep 25, 2025

Replies: 1 comment · 3 replies

Uh oh!

ggerganov Sep 25, 2025 Maintainer

Uh oh!

cristianadam Sep 25, 2025 Author

Uh oh!

cristianadam Sep 25, 2025 Author

Uh oh!

ggerganov Sep 25, 2025 Maintainer

cristianadam
Sep 25, 2025

Replies: 1 comment 3 replies

ggerganov
Sep 25, 2025
Maintainer

cristianadam Sep 25, 2025
Author

cristianadam Sep 25, 2025
Author

ggerganov Sep 25, 2025
Maintainer