You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+16-2Lines changed: 16 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -36,6 +36,20 @@ In addition to providing a significant speedup, T-MAC can also match the same pe
36
36
37
37
> The throughputs of T-MAC are obtained without fast-aggregation. Users can toggle on fast-aggregation through `-fa` to achieve an additional speedup of 10%~20%.
38
38
39
+
### Prefill Speedup
40
+
41
+
> TODO: add more results
42
+
43
+
We have compared the prefill throughput (input_len=256) for Llama-2-7b (W2) on Surface Laptop 7 with two baselines:
Our GEMM kernels demonstrate superior performance over SOTA low-bit GEMM on CPU. The following figure shows the speedup compared to llama.cpp for llama-7b kernels during token generation (NUM_THREADS=1):
@@ -44,7 +58,7 @@ Our GEMM kernels demonstrate superior performance over SOTA low-bit GEMM on CPU.
44
58
45
59
> llama.cpp doesn't provide 1-bit kernel implementation, but we can deduce it from the 2-bit, as it won't bring additional speedup according to the 2/3/4-bit results.
46
60
47
-
Although we haven't integrated multi-batch (N>1) GEMM into llama.cpp, T-MAC can achieve significant speedup due to reduced computaional cost, which ensures superior performance on prompt evaluation and multi-batch token generation. The following figures shows the speedup compared to llama.cpp using OpenBLAS backend (NUM_THREADS=1):
61
+
T-MAC can achieve significant speedup for multi-batch (N>1) GEMM due to reduced computaional cost, which ensures superior performance on prompt evaluation and multi-batch token generation. The following figures shows the speedup compared to llama.cpp using OpenBLAS backend (NUM_THREADS=1):
48
62
49
63

50
64
@@ -305,7 +319,7 @@ Check logs/2024-07-15-17-10-11.log for inference output
305
319
We will soon:
306
320
307
321
-[x] Add `I4` format to simplify the deployment of 4-bit models.
308
-
-[] Embed T-MAC GEMM kernels into llama.cpp to accelerate prefill/prompt.
322
+
-[x] Embed T-MAC GEMM kernels into llama.cpp to accelerate prefill/prompt.
309
323
-[ ] Optimize for ARMv9 CPU with SME2 through LUTI4
0 commit comments