Llama 2 70b Chat not working on M1 Macs when using Metal

# Prerequisites

Please answer the following questions for yourself before submitting an issue.

- [x] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- [x] I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.md).
- [x] I [searched using keywords relevant to my issue](https://docs.github.com/en/issues/tracking-your-work-with-issues/filtering-and-searching-issues-and-pull-requests) to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the [Discussions](https://github.com/ggerganov/llama.cpp/discussions), and have a new bug or useful enhancement to share.

# Expected Behavior

I am trying to run [TheBloke's llama-2-70b-chat.ggmlv3.q2_K.bin](https://huggingface.co/TheBloke/Llama-2-70B-Chat-GGML/tree/main) on my M1 Macbook Pro. It is expected to run.


# Current Behavior

When running ` ./main -m ./models/llama-2-70b-chat.ggmlv3.q2_K.bin -gqa 8 -t 9 -ngl 1 -p "[INST] <<SYS>>You are a helpful assistant<</SYS>>Write a story about llamas[/INST]"` I get the following error:

```
./main -m ./models/llama-2-70b-chat.ggmlv3.q2_K.bin -gqa 8 -t 9 -ngl 1 -p "[INST] <<SYS>>You are a helpful assistant<</SYS>>Write a story about llamas[/INST]"
main: build = 918 (7c529ce)
main: seed  = 1690493628
llama.cpp: loading model from ./models/llama-2-70b-chat.ggmlv3.q2_K.bin
llama_model_load_internal: warning: assuming 70B model based on GQA == 8
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 4096
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_head_kv  = 8
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 8
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 28672
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 10 (mostly Q2_K)
llama_model_load_internal: model size = 70B
llama_model_load_internal: ggml ctx size =    0.21 MB
llama_model_load_internal: mem required  = 27827.36 MB (+  160.00 MB per state)
llama_new_context_with_model: kv self size  =  160.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/john/pythonEnvironments/llamacpp/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x156308740
ggml_metal_init: loaded kernel_add_row                        0x156308de0
ggml_metal_init: loaded kernel_mul                            0x156309280
ggml_metal_init: loaded kernel_mul_row                        0x156309830
ggml_metal_init: loaded kernel_scale                          0x156309cd0
ggml_metal_init: loaded kernel_silu                           0x15630a170
ggml_metal_init: loaded kernel_relu                           0x15630a610
ggml_metal_init: loaded kernel_gelu                           0x15630aab0
ggml_metal_init: loaded kernel_soft_max                       0x15630b0e0
ggml_metal_init: loaded kernel_diag_mask_inf                  0x15630b6c0
ggml_metal_init: loaded kernel_get_rows_f16                   0x15630bcc0
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x15630c430
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x15630ca30
ggml_metal_init: loaded kernel_get_rows_q2_K                  0x15630d030
ggml_metal_init: loaded kernel_get_rows_q3_K                  0x15630d630
ggml_metal_init: loaded kernel_get_rows_q4_K                  0x15630dc30
ggml_metal_init: loaded kernel_get_rows_q5_K                  0x15630e230
ggml_metal_init: loaded kernel_get_rows_q6_K                  0x15630e830
ggml_metal_init: loaded kernel_rms_norm                       0x15630ee70
ggml_metal_init: loaded kernel_norm                           0x15630f610
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x15630fdf0
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x156310430
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x156107190
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32               0x156310930
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32               0x156311090
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32               0x1563116d0
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32               0x156311cf0
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32               0x156312510
ggml_metal_init: loaded kernel_rope                           0x1563129b0
ggml_metal_init: loaded kernel_alibi_f32                      0x156313340
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x156313b50
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x156314360
ggml_metal_init: loaded kernel_cpy_f16_f16                    0x156314a50
ggml_metal_init: recommendedMaxWorkingSetSize = 49152.00 MB
ggml_metal_init: hasUnifiedMemory             = true
ggml_metal_init: maxTransferRate              = built-in GPU
llama_new_context_with_model: max tensor size =   205.08 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size = 27262.61 MB, (27263.06 / 49152.00)
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =    24.17 MB, (27287.23 / 49152.00)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =   162.00 MB, (27449.23 / 49152.00)
ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   237.00 MB, (27686.23 / 49152.00)
ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =   304.00 MB, (27990.23 / 49152.00)

system_info: n_threads = 9 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


GGML_ASSERT: ggml-metal.m:721: ne02 == ne12
zsh: abort      ./main -m ./models/llama-2-70b-chat.ggmlv3.q2_K.bin -gqa 8 -t 9 -ngl 1 -p
```

The same error happens when I try the other quantized models, such as llama-2-70b-chat.ggmlv3.q4_K_M.bin and llama-2-70b-chat.ggmlv3.q4_K_S.bin


I get similar error when using the server:
```
./server -ngl 1 -t 9 -m ./models/llama-2-70b-chat.ggmlv3.q2_K.bin -gqa 8 -c 4096
{"timestamp":1690494784,"level":"INFO","function":"main","line":1124,"message":"build info","build":918,"commit":"7c529ce"}
{"timestamp":1690494784,"level":"INFO","function":"main","line":1129,"message":"system info","n_threads":9,"total_threads":10,"system_info":"AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | "}
llama.cpp: loading model from ./models/llama-2-70b-chat.ggmlv3.q2_K.bin
llama_model_load_internal: warning: assuming 70B model based on GQA == 8
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 4096
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 4096
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_head_kv  = 8
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 8
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 28672
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 10 (mostly Q2_K)
llama_model_load_internal: model size = 70B
llama_model_load_internal: ggml ctx size =    0.21 MB
llama_model_load_internal: mem required  = 28339.36 MB (+ 1280.00 MB per state)
llama_new_context_with_model: kv self size  = 1280.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/john/pythonEnvironments/llamacpp/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x152805ac0
ggml_metal_init: loaded kernel_add_row                        0x1356044b0
ggml_metal_init: loaded kernel_mul                            0x135604aa0
ggml_metal_init: loaded kernel_mul_row                        0x135605050
ggml_metal_init: loaded kernel_scale                          0x1356054f0
ggml_metal_init: loaded kernel_silu                           0x135605990
ggml_metal_init: loaded kernel_relu                           0x135605e30
ggml_metal_init: loaded kernel_gelu                           0x1356062d0
ggml_metal_init: loaded kernel_soft_max                       0x135606900
ggml_metal_init: loaded kernel_diag_mask_inf                  0x135606ee0
ggml_metal_init: loaded kernel_get_rows_f16                   0x1356074e0
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x135607c50
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x135608250
ggml_metal_init: loaded kernel_get_rows_q2_K                  0x135608850
ggml_metal_init: loaded kernel_get_rows_q3_K                  0x135608e50
ggml_metal_init: loaded kernel_get_rows_q4_K                  0x135609450
ggml_metal_init: loaded kernel_get_rows_q5_K                  0x135609a50
ggml_metal_init: loaded kernel_get_rows_q6_K                  0x13560a050
ggml_metal_init: loaded kernel_rms_norm                       0x13560a690
ggml_metal_init: loaded kernel_norm                           0x13560ae30
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x13560b610
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x13560bc50
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x13560c290
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32               0x152a09450
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32               0x152a09a90
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32               0x152a0a0d0
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32               0x152a0a6f0
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32               0x152a0af10
ggml_metal_init: loaded kernel_rope                           0x152a0b3b0
ggml_metal_init: loaded kernel_alibi_f32                      0x152a0bd40
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x152a0c550
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x152a0cd60
ggml_metal_init: loaded kernel_cpy_f16_f16                    0x152806040
ggml_metal_init: recommendedMaxWorkingSetSize = 49152.00 MB
ggml_metal_init: hasUnifiedMemory             = true
ggml_metal_init: maxTransferRate              = built-in GPU
llama_new_context_with_model: max tensor size =   205.08 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size = 27262.61 MB, (27263.06 / 49152.00)
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =    24.17 MB, (27287.23 / 49152.00)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =  1282.00 MB, (28569.23 / 49152.00)
ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   749.00 MB, (29318.23 / 49152.00)
ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =   304.00 MB, (29622.23 / 49152.00)

llama server listening at http://127.0.0.1:8080

{"timestamp":1690494785,"level":"INFO","function":"main","line":1344,"message":"HTTP server listening","hostname":"127.0.0.1","port":8080}
{"timestamp":1690494790,"level":"INFO","function":"log_server_request","line":1097,"message":"request","remote_addr":"127.0.0.1","remote_port":60293,"status":200,"method":"GET","path":"/","params":{}}
{"timestamp":1690494791,"level":"INFO","function":"log_server_request","line":1097,"message":"request","remote_addr":"127.0.0.1","remote_port":60293,"status":200,"method":"GET","path":"/index.js","params":{}}
{"timestamp":1690494791,"level":"INFO","function":"log_server_request","line":1097,"message":"request","remote_addr":"127.0.0.1","remote_port":60294,"status":200,"method":"GET","path":"/completion.js","params":{}}
{"timestamp":1690494791,"level":"INFO","function":"log_server_request","line":1097,"message":"request","remote_addr":"127.0.0.1","remote_port":60293,"status":404,"method":"GET","path":"/favicon.ico","params":{}}
{"timestamp":1690494792,"level":"INFO","function":"log_server_request","line":1097,"message":"request","remote_addr":"127.0.0.1","remote_port":60293,"status":404,"method":"GET","path":"/favicon.ico","params":{}}
GGML_ASSERT: ggml-metal.m:721: ne02 == ne12
zsh: abort      ./server -ngl 1 -t 9 -m ./models/llama-2-70b-chat.ggmlv3.q2_K.bin -gqa 8 -c
```

The 70b models only work when not using metal by omitting `-ngl 1`.

The error only happens with the 70b models. The smaller 13b llama 2 chat models work as expected.

# Environment and Context

Running on my M1 Macbook Pro. 

  Model Name:	MacBook Pro
  Model Identifier:	MacBookPro18,2
  Model Number:	MK233LL/A
  Chip:	Apple M1 Max
  Total Number of Cores:	10 (8 performance and 2 efficiency)
  Memory:	64 GB
  System Firmware Version:	8419.60.44
  OS Loader Version:	8419.60.44
  Serial Number (system):
  Hardware UUID:	
  Provisioning UDID:	
  Activation Lock Status:	Enabled

llama.cpp built with `LLAMA_METAL=1 make`

```
uname -a
Darwin Johns-MacBook-Pro-2.local 22.2.0 Darwin Kernel Version 22.2.0: Fri Nov 11 02:03:51 PST 2022; root:xnu-8792.61.2~4/RELEASE_ARM64_T6000 arm64

python3 --version
Python 3.9.15

make --version
GNU Make 3.81
Copyright (C) 2006  Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.

This program built for i386-apple-darwin11.3.0



g++ --version
Apple clang version 14.0.0 (clang-1400.0.29.202)
Target: arm64-apple-darwin22.2.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

```



```
git log | head -1
commit 7c529cede6e84054e77a3eceab31c53de7b2f55b
```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Llama 2 70b Chat not working on M1 Macs when using Metal #2429

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Llama 2 70b Chat not working on M1 Macs when using Metal #2429

Description

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions