You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+14-1Lines changed: 14 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -36,7 +36,16 @@ In addition to providing a significant speedup, T-MAC can also match the same pe
36
36
<p>T-MAC achieves significant speedup at single-threads and consumes much less CPU cores to reach the same throughput</p>
37
37
</h3>
38
38
39
-
> The throughputs of T-MAC are obtained without fast-aggregation. Users can toggle on fast-aggregation through `-fa` to achieve an additional speedup of 10%~20%.
39
+
> The throughputs of T-MAC are obtained without fast-aggregation. Users can toggle on fast-aggregation through `-fa` to achieve an additional speedup of 10%~20% with.
40
+
41
+
The figure above shows that when the model size is increased to 7B-4bit, the multi-threading throughput of llama.cpp on Surface Laptop 7 becomes highly unstable due to the thermal threshold under *Better Performance* mode. This instability is not observed with T-MAC, as LUT is more energy-efficient compared to multiply-add operations. To establish a more solid baseline, we re-profile the performance under the *Best Performance* mode:
42
+
43
+
<h3align="center">
44
+
<img src="assets/e2e_threads_surface_max.png">
45
+
<p>The throughput of T-MAC and llama.cpp both increase by maximizing CPU frequency</p>
46
+
</h3>
47
+
48
+
> However, under real-world situations, CPUs can't maintain maximum frequency consistently on edge devices. The performance of llama.cpp will degrade as indicated by the results under the *Better Performance* mode.
40
49
41
50
### Prefill Speedup
42
51
@@ -59,6 +68,8 @@ Our GEMM kernels demonstrate superior performance over SOTA low-bit GEMM on CPU.
59
68

60
69
61
70
> llama.cpp doesn't provide 1-bit kernel implementation, but we can deduce it from the 2-bit, as it won't bring additional speedup according to the 2/3/4-bit results.
71
+
>
72
+
> Surface stands for Surface Book 3 in this section.
62
73
63
74
T-MAC can achieve significant speedup for multi-batch (N>1) GEMM due to reduced computaional cost, which ensures superior performance on prompt evaluation and multi-batch token generation. The following figures shows the speedup compared to llama.cpp using OpenBLAS backend (NUM_THREADS=1):
64
75
@@ -93,6 +104,8 @@ When deploying the llama-2-7b-4bit model on it, the NPU can only generate 10.4 t
93
104
| NPE (NPU) | llama-2-7b (W4) | - | 10.4 |
94
105
95
106
> For fair comparison, we have aligned our settings with those of the NPU, including a input length of 1024 and an output length of 1024. Although Qualcomms deploy a model of 3.6GB, we deploy a slightly larger model of 3.7GB, due to our token-embed remaining un-quantized.
107
+
>
108
+
> By maximizing CPU frequency, T-MAC (CPU) can even get better results. Refer to the discussion in [End-2-End speedup](#end-2-end-speedup).
0 commit comments