-
Notifications
You must be signed in to change notification settings - Fork 13.1k
Description
Prerequisites
Please answer the following questions for yourself before submitting an issue.
- I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new bug or useful enhancement to share.
Expected Behavior
I am trying to run TheBloke's llama-2-70b-chat.ggmlv3.q2_K.bin on my M1 Macbook Pro. It is expected to run.
Current Behavior
When running ./main -m ./models/llama-2-70b-chat.ggmlv3.q2_K.bin -gqa 8 -t 9 -ngl 1 -p "[INST] <<SYS>>You are a helpful assistant<</SYS>>Write a story about llamas[/INST]"
I get the following error:
./main -m ./models/llama-2-70b-chat.ggmlv3.q2_K.bin -gqa 8 -t 9 -ngl 1 -p "[INST] <<SYS>>You are a helpful assistant<</SYS>>Write a story about llamas[/INST]"
main: build = 918 (7c529ce)
main: seed = 1690493628
llama.cpp: loading model from ./models/llama-2-70b-chat.ggmlv3.q2_K.bin
llama_model_load_internal: warning: assuming 70B model based on GQA == 8
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 8192
llama_model_load_internal: n_mult = 4096
llama_model_load_internal: n_head = 64
llama_model_load_internal: n_head_kv = 8
llama_model_load_internal: n_layer = 80
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 8
llama_model_load_internal: rnorm_eps = 5.0e-06
llama_model_load_internal: n_ff = 28672
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 10 (mostly Q2_K)
llama_model_load_internal: model size = 70B
llama_model_load_internal: ggml ctx size = 0.21 MB
llama_model_load_internal: mem required = 27827.36 MB (+ 160.00 MB per state)
llama_new_context_with_model: kv self size = 160.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/john/pythonEnvironments/llamacpp/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x156308740
ggml_metal_init: loaded kernel_add_row 0x156308de0
ggml_metal_init: loaded kernel_mul 0x156309280
ggml_metal_init: loaded kernel_mul_row 0x156309830
ggml_metal_init: loaded kernel_scale 0x156309cd0
ggml_metal_init: loaded kernel_silu 0x15630a170
ggml_metal_init: loaded kernel_relu 0x15630a610
ggml_metal_init: loaded kernel_gelu 0x15630aab0
ggml_metal_init: loaded kernel_soft_max 0x15630b0e0
ggml_metal_init: loaded kernel_diag_mask_inf 0x15630b6c0
ggml_metal_init: loaded kernel_get_rows_f16 0x15630bcc0
ggml_metal_init: loaded kernel_get_rows_q4_0 0x15630c430
ggml_metal_init: loaded kernel_get_rows_q4_1 0x15630ca30
ggml_metal_init: loaded kernel_get_rows_q2_K 0x15630d030
ggml_metal_init: loaded kernel_get_rows_q3_K 0x15630d630
ggml_metal_init: loaded kernel_get_rows_q4_K 0x15630dc30
ggml_metal_init: loaded kernel_get_rows_q5_K 0x15630e230
ggml_metal_init: loaded kernel_get_rows_q6_K 0x15630e830
ggml_metal_init: loaded kernel_rms_norm 0x15630ee70
ggml_metal_init: loaded kernel_norm 0x15630f610
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x15630fdf0
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x156310430
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32 0x156107190
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32 0x156310930
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32 0x156311090
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32 0x1563116d0
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32 0x156311cf0
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32 0x156312510
ggml_metal_init: loaded kernel_rope 0x1563129b0
ggml_metal_init: loaded kernel_alibi_f32 0x156313340
ggml_metal_init: loaded kernel_cpy_f32_f16 0x156313b50
ggml_metal_init: loaded kernel_cpy_f32_f32 0x156314360
ggml_metal_init: loaded kernel_cpy_f16_f16 0x156314a50
ggml_metal_init: recommendedMaxWorkingSetSize = 49152.00 MB
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: maxTransferRate = built-in GPU
llama_new_context_with_model: max tensor size = 205.08 MB
ggml_metal_add_buffer: allocated 'data ' buffer, size = 27262.61 MB, (27263.06 / 49152.00)
ggml_metal_add_buffer: allocated 'eval ' buffer, size = 24.17 MB, (27287.23 / 49152.00)
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 162.00 MB, (27449.23 / 49152.00)
ggml_metal_add_buffer: allocated 'scr0 ' buffer, size = 237.00 MB, (27686.23 / 49152.00)
ggml_metal_add_buffer: allocated 'scr1 ' buffer, size = 304.00 MB, (27990.23 / 49152.00)
system_info: n_threads = 9 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0
GGML_ASSERT: ggml-metal.m:721: ne02 == ne12
zsh: abort ./main -m ./models/llama-2-70b-chat.ggmlv3.q2_K.bin -gqa 8 -t 9 -ngl 1 -p
The same error happens when I try the other quantized models, such as llama-2-70b-chat.ggmlv3.q4_K_M.bin and llama-2-70b-chat.ggmlv3.q4_K_S.bin
I get similar error when using the server:
./server -ngl 1 -t 9 -m ./models/llama-2-70b-chat.ggmlv3.q2_K.bin -gqa 8 -c 4096
{"timestamp":1690494784,"level":"INFO","function":"main","line":1124,"message":"build info","build":918,"commit":"7c529ce"}
{"timestamp":1690494784,"level":"INFO","function":"main","line":1129,"message":"system info","n_threads":9,"total_threads":10,"system_info":"AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | "}
llama.cpp: loading model from ./models/llama-2-70b-chat.ggmlv3.q2_K.bin
llama_model_load_internal: warning: assuming 70B model based on GQA == 8
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 4096
llama_model_load_internal: n_embd = 8192
llama_model_load_internal: n_mult = 4096
llama_model_load_internal: n_head = 64
llama_model_load_internal: n_head_kv = 8
llama_model_load_internal: n_layer = 80
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 8
llama_model_load_internal: rnorm_eps = 5.0e-06
llama_model_load_internal: n_ff = 28672
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 10 (mostly Q2_K)
llama_model_load_internal: model size = 70B
llama_model_load_internal: ggml ctx size = 0.21 MB
llama_model_load_internal: mem required = 28339.36 MB (+ 1280.00 MB per state)
llama_new_context_with_model: kv self size = 1280.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/john/pythonEnvironments/llamacpp/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x152805ac0
ggml_metal_init: loaded kernel_add_row 0x1356044b0
ggml_metal_init: loaded kernel_mul 0x135604aa0
ggml_metal_init: loaded kernel_mul_row 0x135605050
ggml_metal_init: loaded kernel_scale 0x1356054f0
ggml_metal_init: loaded kernel_silu 0x135605990
ggml_metal_init: loaded kernel_relu 0x135605e30
ggml_metal_init: loaded kernel_gelu 0x1356062d0
ggml_metal_init: loaded kernel_soft_max 0x135606900
ggml_metal_init: loaded kernel_diag_mask_inf 0x135606ee0
ggml_metal_init: loaded kernel_get_rows_f16 0x1356074e0
ggml_metal_init: loaded kernel_get_rows_q4_0 0x135607c50
ggml_metal_init: loaded kernel_get_rows_q4_1 0x135608250
ggml_metal_init: loaded kernel_get_rows_q2_K 0x135608850
ggml_metal_init: loaded kernel_get_rows_q3_K 0x135608e50
ggml_metal_init: loaded kernel_get_rows_q4_K 0x135609450
ggml_metal_init: loaded kernel_get_rows_q5_K 0x135609a50
ggml_metal_init: loaded kernel_get_rows_q6_K 0x13560a050
ggml_metal_init: loaded kernel_rms_norm 0x13560a690
ggml_metal_init: loaded kernel_norm 0x13560ae30
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x13560b610
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x13560bc50
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32 0x13560c290
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32 0x152a09450
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32 0x152a09a90
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32 0x152a0a0d0
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32 0x152a0a6f0
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32 0x152a0af10
ggml_metal_init: loaded kernel_rope 0x152a0b3b0
ggml_metal_init: loaded kernel_alibi_f32 0x152a0bd40
ggml_metal_init: loaded kernel_cpy_f32_f16 0x152a0c550
ggml_metal_init: loaded kernel_cpy_f32_f32 0x152a0cd60
ggml_metal_init: loaded kernel_cpy_f16_f16 0x152806040
ggml_metal_init: recommendedMaxWorkingSetSize = 49152.00 MB
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: maxTransferRate = built-in GPU
llama_new_context_with_model: max tensor size = 205.08 MB
ggml_metal_add_buffer: allocated 'data ' buffer, size = 27262.61 MB, (27263.06 / 49152.00)
ggml_metal_add_buffer: allocated 'eval ' buffer, size = 24.17 MB, (27287.23 / 49152.00)
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 1282.00 MB, (28569.23 / 49152.00)
ggml_metal_add_buffer: allocated 'scr0 ' buffer, size = 749.00 MB, (29318.23 / 49152.00)
ggml_metal_add_buffer: allocated 'scr1 ' buffer, size = 304.00 MB, (29622.23 / 49152.00)
llama server listening at http://127.0.0.1:8080
{"timestamp":1690494785,"level":"INFO","function":"main","line":1344,"message":"HTTP server listening","hostname":"127.0.0.1","port":8080}
{"timestamp":1690494790,"level":"INFO","function":"log_server_request","line":1097,"message":"request","remote_addr":"127.0.0.1","remote_port":60293,"status":200,"method":"GET","path":"/","params":{}}
{"timestamp":1690494791,"level":"INFO","function":"log_server_request","line":1097,"message":"request","remote_addr":"127.0.0.1","remote_port":60293,"status":200,"method":"GET","path":"/index.js","params":{}}
{"timestamp":1690494791,"level":"INFO","function":"log_server_request","line":1097,"message":"request","remote_addr":"127.0.0.1","remote_port":60294,"status":200,"method":"GET","path":"/completion.js","params":{}}
{"timestamp":1690494791,"level":"INFO","function":"log_server_request","line":1097,"message":"request","remote_addr":"127.0.0.1","remote_port":60293,"status":404,"method":"GET","path":"/favicon.ico","params":{}}
{"timestamp":1690494792,"level":"INFO","function":"log_server_request","line":1097,"message":"request","remote_addr":"127.0.0.1","remote_port":60293,"status":404,"method":"GET","path":"/favicon.ico","params":{}}
GGML_ASSERT: ggml-metal.m:721: ne02 == ne12
zsh: abort ./server -ngl 1 -t 9 -m ./models/llama-2-70b-chat.ggmlv3.q2_K.bin -gqa 8 -c
The 70b models only work when not using metal by omitting -ngl 1
.
The error only happens with the 70b models. The smaller 13b llama 2 chat models work as expected.
Environment and Context
Running on my M1 Macbook Pro.
Model Name: MacBook Pro
Model Identifier: MacBookPro18,2
Model Number: MK233LL/A
Chip: Apple M1 Max
Total Number of Cores: 10 (8 performance and 2 efficiency)
Memory: 64 GB
System Firmware Version: 8419.60.44
OS Loader Version: 8419.60.44
Serial Number (system):
Hardware UUID:
Provisioning UDID:
Activation Lock Status: Enabled
llama.cpp built with LLAMA_METAL=1 make
uname -a
Darwin Johns-MacBook-Pro-2.local 22.2.0 Darwin Kernel Version 22.2.0: Fri Nov 11 02:03:51 PST 2022; root:xnu-8792.61.2~4/RELEASE_ARM64_T6000 arm64
python3 --version
Python 3.9.15
make --version
GNU Make 3.81
Copyright (C) 2006 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
This program built for i386-apple-darwin11.3.0
g++ --version
Apple clang version 14.0.0 (clang-1400.0.29.202)
Target: arm64-apple-darwin22.2.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
git log | head -1
commit 7c529cede6e84054e77a3eceab31c53de7b2f55b