-
-
Notifications
You must be signed in to change notification settings - Fork 10.4k
[TPU] Add support for online w8a8 quantization #22425
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for w8a8 quantization on TPUs by introducing a 'dynamic' activation scheme. The changes correctly plumb this new option down to a new torch_xla
operator. My review focuses on improving the error handling for cases where an incompatible version of torch_xla
is installed, to provide a better user experience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current try...except
block only catches an ImportError
if torch_xla
or the custom_kernel
module is missing. However, it doesn't handle the case where torch_xla
is installed but is an older version that doesn't have the quantized_matmul_int8
operator. In that scenario, an AttributeError
will be raised later when the operator is called, which can be confusing for the user.
To provide a better user experience and a more informative error message, it's best to check for the operator's existence within the try
block and catch both ImportError
and AttributeError
. This ensures that users with an incompatible torch_xla
version get a clear message on how to resolve the issue.
try:
import torch_xla.experimental.custom_kernel # noqa: F401
# Eagerly check for the op to provide a better error message.
_ = torch.ops.xla.quantized_matmul_int8
except (ImportError, AttributeError) as err:
raise ImportError(
"torch_xla is not installed or is too old to support w8a8 "
"quantization. Please install/update torch_xla by following "
"the instructions at "
"https://docs.vllm.ai/en/latest/getting_started/tpu-installation.html " # noqa: E501
"to run vLLM on TPU.") from err
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution, @kyuyeunk .
Could you also add a test for this feature?
053a8d5
to
5232090
Compare
done. @yaochengji , can you take a look again? |
76bf91b
to
d4a94ac
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the test!
Could move it to tests/v1/tpu
?
And then add the test in CI .buildkite/scripts/hardware_ci/run-tpu-v1-test-part2.sh
?
Also could we also check the response, at least to make sure it doesn't generate garbage?
d4a94ac
to
bdcf43c
Compare
Done. And verified that it produces correct response.
|
bdcf43c
to
ce9f7cc
Compare
@kyuyeunk Could you add the response check in the test code? |
47fe959
to
ea92156
Compare
Do you have a suggested to achieve this? Not sure what is the best automated approach to check if an output is garbage or not. |
Expand existing online w8a16 quantization to support w8a8 - Using `{'activation_scheme': 'dynamic'}` utilizes dynamic activation quantization - Using `{'activation_scheme': 'none'}` utilizes weight only quantization Signed-off-by: Kyuyeun Kim <[email protected]>
ea92156
to
9cb83fc
Compare
Done! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
Signed-off-by: Kyuyeun Kim <[email protected]> Signed-off-by: Paul Pak <[email protected]>
Signed-off-by: Kyuyeun Kim <[email protected]> Signed-off-by: Diego-Castan <[email protected]>
Signed-off-by: Kyuyeun Kim <[email protected]>
Signed-off-by: Kyuyeun Kim <[email protected]>
Signed-off-by: Kyuyeun Kim <[email protected]> Signed-off-by: Xiao Yu <[email protected]>
Signed-off-by: Kyuyeun Kim <[email protected]>
Expand existing online w8a16 quantization to support w8a8
{'activation_scheme': 'dynamic'}
utilizes dynamic activation quantization{'activation_scheme': 'none'}
utilizes weight only quantization