-
-
Notifications
You must be signed in to change notification settings - Fork 10.2k
Closed
Closed
Copy link
Labels
feature requestNew feature or requestNew feature or requestunstaleRecieved activity after being labelled staleRecieved activity after being labelled stale
Description
🚀 The feature, motivation and pitch
A triton implementation to support MoE layers quantized with GPTQ or AWQ was implemented in #12185
It is more performant than the current Marlin MoE kernel in the case where there are many, small experts - which is why I ported it to be the default in the case of num_experts > 32
for AWQ and GPTQMarlin configs #13236
We should also propagate the usage of this kernel to compressed-tensors
that have mixed precision.
Alternatives
No response
Additional context
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Labels
feature requestNew feature or requestNew feature or requestunstaleRecieved activity after being labelled staleRecieved activity after being labelled stale