Skip to content

Commit 6a09462

Browse files
committed
Update README.md
1 parent a86858c commit 6a09462

File tree

1 file changed

+19
-0
lines changed

1 file changed

+19
-0
lines changed

README.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,8 @@
77

88
## News
99

10+
- 07/27/2024 ✨: We've noted that T-MAC is even faster than the NPU in token generation speed on the latest Snapdragon X Elite chipset! Check [Compared to NPU](#compared-to-npu) for more details.
11+
1012
- 07/23/2024 🚀🚀: We've enabled the execution of any 2-bit quantized Llama model in GPTQ format via T-MAC! Test it using the pretrained models released by [EfficientQAT](https://github.com/OpenGVLab/EfficientQAT).
1113

1214
- 07/22/2024 🚀🚀: We've added native deployment support for Windows on ARM. T-MAC demonstrates a substantial 5x speedup on the Surface Laptop 7.
@@ -57,6 +59,23 @@ By replacing heavy fused-multiply-add instructions with table lookup instruction
5759

5860
> Data sampled with [powermetrics](https://www.unix.com/man-page/osx/1/powermetrics/).
5961
62+
### Compared to NPU
63+
64+
On the latest Snapdragon X Elite chipset, CPU through T-MAC achieves better performance compared to NPU through Qualcomm Snapdragon Neural Processing Engine (NPE).
65+
66+
When deploying the llama-2-7b-4bit model on it, the NPU can only generate 10.4 tokens/sec (according to the data released [here](https://aihub.qualcomm.com/models/llama_v2_7b_chat_quantized)), while the CPU using T-MAC can reach 12.6 tokens/sec with two cores, and even up to 22 tokens/sec. Considering that T-MAC's computing performance can linearly improve with the number of bits decreases (which is not observable on GPUs and NPUs based on dequantization), T-MAC can even match the NPU with a single-core CPU at 2 bits.
67+
68+
| Framework | Model | NUM_THREADS | Throughput (tokens/sec) |
69+
|-----------------|-----------------|-------------|:------------------------|
70+
| T-MAC (CPU) | llama-2-7b (W4) | 2 | <b>12.6</b> |
71+
| T-MAC (CPU) | llama-2-7b (W4) | 4 | <b>18.7</b> |
72+
| T-MAC (CPU) | llama-2-7b (W2) | 1 | 9.3 |
73+
| T-MAC (CPU) | llama-2-7b (W2) | 4 | <b>28.4</b> |
74+
| | | | |
75+
| NPE (NPU) | llama-2-7b (W4) | - | 10.4 |
76+
77+
> For fair comparison, we have aligned our settings with those of the NPU, including a input length of 1024 and an output length of 1024. Although Qualcomms deploy a model of 3.6GB, we deploy a slightly larger model of 3.7GB, due to our token-embed remaining un-quantized.
78+
6079
### Compared to CUDA GPU
6180

6281
T-MAC achieves comparable 2-bit mpGEMM performance compared to CUDA GPU on Jetson AGX Orin. While the CUDA GPU outperforms the CPU in executing kernels other than mpGEMM, making the end-to-end performance of T-MAC (CPU) slightly slower, T-MAC can deliver considerable savings in power and energy consumption.

0 commit comments

Comments
 (0)