Skip to content

Commit 2afd8d1

Browse files
committed
[Feat] Integrate GEMM kernels for better prefill performance
1 parent 2ba59db commit 2afd8d1

File tree

3 files changed

+17
-4
lines changed

3 files changed

+17
-4
lines changed

3rdparty/llama.cpp

README.md

Lines changed: 16 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,20 @@ In addition to providing a significant speedup, T-MAC can also match the same pe
3636

3737
> The throughputs of T-MAC are obtained without fast-aggregation. Users can toggle on fast-aggregation through `-fa` to achieve an additional speedup of 10%~20%.
3838
39+
### Prefill Speedup
40+
41+
> TODO: add more results
42+
43+
We have compared the prefill throughput (input_len=256) for Llama-2-7b (W2) on Surface Laptop 7 with two baselines:
44+
45+
- llama.cpp: llama.cpp optimized dequant-based low-bit kernels
46+
- llama.cpp (OpenBLAS): llama.cpp OpenBLAS backend
47+
48+
| Model | NUM_THREADS | Batch Size | T-MAC (tokens/sec) | llama.cpp (OpenBLAS) | llama.cpp |
49+
|-----------------|-------------|------------|:------------------------|:---------------------|:----------|
50+
| llama-2-7b (W2) | 4 | 256 | 50.1 | 21.5 | 12.0 |
51+
| llama-2-7b (W2) | 8 | 256 | 94.4 | 37.7 | 21.3 |
52+
3953
## Kernel-level Speedup
4054

4155
Our GEMM kernels demonstrate superior performance over SOTA low-bit GEMM on CPU. The following figure shows the speedup compared to llama.cpp for llama-7b kernels during token generation (NUM_THREADS=1):
@@ -44,7 +58,7 @@ Our GEMM kernels demonstrate superior performance over SOTA low-bit GEMM on CPU.
4458

4559
> llama.cpp doesn't provide 1-bit kernel implementation, but we can deduce it from the 2-bit, as it won't bring additional speedup according to the 2/3/4-bit results.
4660
47-
Although we haven't integrated multi-batch (N>1) GEMM into llama.cpp, T-MAC can achieve significant speedup due to reduced computaional cost, which ensures superior performance on prompt evaluation and multi-batch token generation. The following figures shows the speedup compared to llama.cpp using OpenBLAS backend (NUM_THREADS=1):
61+
T-MAC can achieve significant speedup for multi-batch (N>1) GEMM due to reduced computaional cost, which ensures superior performance on prompt evaluation and multi-batch token generation. The following figures shows the speedup compared to llama.cpp using OpenBLAS backend (NUM_THREADS=1):
4862

4963
![](assets/gemm.png)
5064

@@ -305,7 +319,7 @@ Check logs/2024-07-15-17-10-11.log for inference output
305319
We will soon:
306320

307321
- [x] Add `I4` format to simplify the deployment of 4-bit models.
308-
- [ ] Embed T-MAC GEMM kernels into llama.cpp to accelerate prefill/prompt.
322+
- [x] Embed T-MAC GEMM kernels into llama.cpp to accelerate prefill/prompt.
309323
- [ ] Optimize for ARMv9 CPU with SME2 through LUTI4
310324

311325
## Techniques

tools/run_pipeline.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -155,7 +155,6 @@ def run_inference():
155155
'-n', '128',
156156
'-t', f'{FLAGS.num_threads}',
157157
'-p', prompt,
158-
'-b', '1',
159158
'-ngl', '0',
160159
'-c', '2048'
161160
]

0 commit comments

Comments
 (0)