[TPU] Add support for online w8a8 quantization #22425

kyuyeunk · 2025-08-07T04:25:05Z

Expand existing online w8a16 quantization to support w8a8

Using {'activation_scheme': 'dynamic'} utilizes dynamic activation quantization
Using {'activation_scheme': 'none'} utilizes weight only quantization

github-actions · 2025-08-07T04:25:13Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request adds support for w8a8 quantization on TPUs by introducing a 'dynamic' activation scheme. The changes correctly plumb this new option down to a new torch_xla operator. My review focuses on improving the error handling for cases where an incompatible version of torch_xla is installed, to provide a better user experience.

gemini-code-assist · 2025-08-07T04:26:21Z

vllm/model_executor/layers/quantization/tpu_int8.py

The current try...except block only catches an ImportError if torch_xla or the custom_kernel module is missing. However, it doesn't handle the case where torch_xla is installed but is an older version that doesn't have the quantized_matmul_int8 operator. In that scenario, an AttributeError will be raised later when the operator is called, which can be confusing for the user.

To provide a better user experience and a more informative error message, it's best to check for the operator's existence within the try block and catch both ImportError and AttributeError. This ensures that users with an incompatible torch_xla version get a clear message on how to resolve the issue.

try: import torch_xla.experimental.custom_kernel # noqa: F401 # Eagerly check for the op to provide a better error message. _ = torch.ops.xla.quantized_matmul_int8 except (ImportError, AttributeError) as err: raise ImportError( "torch_xla is not installed or is too old to support w8a8 " "quantization. Please install/update torch_xla by following " "the instructions at " "https://docs.vllm.ai/en/latest/getting_started/tpu-installation.html " # noqa: E501 "to run vLLM on TPU.") from err

yaochengji

Thanks for the contribution, @kyuyeunk .

Could you also add a test for this feature?

kyuyeunk · 2025-08-07T22:32:37Z

Thanks for the contribution, @kyuyeunk .

Could you also add a test for this feature?

done. @yaochengji , can you take a look again?

yaochengji

Thanks for adding the test!

Could move it to tests/v1/tpu?
And then add the test in CI .buildkite/scripts/hardware_ci/run-tpu-v1-test-part2.sh?

Also could we also check the response, at least to make sure it doesn't generate garbage?

kyuyeunk · 2025-08-08T23:03:59Z

Thanks for adding the test!

Could move it to tests/v1/tpu? And then add the test in CI .buildkite/scripts/hardware_ci/run-tpu-v1-test-part2.sh?

Also could we also check the response, at least to make sure it doesn't generate garbage?

Done. And verified that it produces correct response.

prompt: 'vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\n'
response: 'vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\nThe core of the engine is a distributed memory system'

yaochengji · 2025-08-08T23:10:31Z

prompt: 'vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\n'
response: 'vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\nThe core of the engine is a distributed memory system'

@kyuyeunk Could you add the response check in the test code?

kyuyeunk · 2025-08-09T00:15:31Z

prompt: 'vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\n'
response: 'vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\nThe core of the engine is a distributed memory system'

@kyuyeunk Could you add the response check in the test code?

Do you have a suggested to achieve this? Not sure what is the best automated approach to check if an output is garbage or not.

Expand existing online w8a16 quantization to support w8a8 - Using `{'activation_scheme': 'dynamic'}` utilizes dynamic activation quantization - Using `{'activation_scheme': 'none'}` utilizes weight only quantization Signed-off-by: Kyuyeun Kim <[email protected]>

kyuyeunk · 2025-08-09T00:57:13Z

prompt: 'vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\n'
response: 'vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\nThe core of the engine is a distributed memory system'

@kyuyeunk Could you add the response check in the test code?

Do you have a suggested to achieve this? Not sure what is the best automated approach to check if an output is garbage or not.

Done!

yaochengji

LGTM, thanks!

Signed-off-by: Kyuyeun Kim <[email protected]> Signed-off-by: Paul Pak <[email protected]>

Signed-off-by: Kyuyeun Kim <[email protected]> Signed-off-by: Diego-Castan <[email protected]>

Signed-off-by: Kyuyeun Kim <[email protected]>

Signed-off-by: Kyuyeun Kim <[email protected]> Signed-off-by: Xiao Yu <[email protected]>

Signed-off-by: Kyuyeun Kim <[email protected]>

kyuyeunk requested review from mgoin, robertgshaw2-redhat and tlrmchlsmth as code owners August 7, 2025 04:25

mergify bot added the tpu Related to Google TPUs label Aug 7, 2025

gemini-code-assist bot reviewed Aug 7, 2025

View reviewed changes

yaochengji reviewed Aug 7, 2025

View reviewed changes

kyuyeunk force-pushed the online_w8a8_tpu branch 2 times, most recently from 053a8d5 to 5232090 Compare August 7, 2025 22:32

kyuyeunk force-pushed the online_w8a8_tpu branch 2 times, most recently from 76bf91b to d4a94ac Compare August 8, 2025 16:30

yaochengji reviewed Aug 8, 2025

View reviewed changes

kyuyeunk force-pushed the online_w8a8_tpu branch from d4a94ac to bdcf43c Compare August 8, 2025 22:56

mergify bot added ci/build v1 labels Aug 8, 2025

kyuyeunk force-pushed the online_w8a8_tpu branch from bdcf43c to ce9f7cc Compare August 8, 2025 23:04

kyuyeunk requested a review from yaochengji August 8, 2025 23:04

kyuyeunk force-pushed the online_w8a8_tpu branch 2 times, most recently from 47fe959 to ea92156 Compare August 8, 2025 23:37

kyuyeunk force-pushed the online_w8a8_tpu branch from ea92156 to 9cb83fc Compare August 9, 2025 00:57

yaochengji approved these changes Aug 9, 2025

View reviewed changes

yaochengji enabled auto-merge (squash) August 9, 2025 01:05

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 9, 2025

vllm-bot merged commit 9a0c5de into vllm-project:main Aug 9, 2025
49 of 56 checks passed

kyuyeunk deleted the online_w8a8_tpu branch August 9, 2025 06:16

paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025

[TPU] Add support for online w8a8 quantization (vllm-project#22425)

5406ff3

Signed-off-by: Kyuyeun Kim <[email protected]> Signed-off-by: Paul Pak <[email protected]>

diegocastanibm pushed a commit to diegocastanibm/vllm that referenced this pull request Aug 15, 2025

[TPU] Add support for online w8a8 quantization (vllm-project#22425)

07bcdeb

Signed-off-by: Kyuyeun Kim <[email protected]> Signed-off-by: Diego-Castan <[email protected]>

yiliu30 pushed a commit to yiliu30/vllm-fork that referenced this pull request Aug 19, 2025

[TPU] Add support for online w8a8 quantization (vllm-project#22425)

96e7175

Signed-off-by: Kyuyeun Kim <[email protected]>

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025

[TPU] Add support for online w8a8 quantization (vllm-project#22425)

3431246

Signed-off-by: Kyuyeun Kim <[email protected]>

xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025

[TPU] Add support for online w8a8 quantization (vllm-project#22425)

a55812f

Signed-off-by: Kyuyeun Kim <[email protected]> Signed-off-by: Xiao Yu <[email protected]>

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025

[TPU] Add support for online w8a8 quantization (vllm-project#22425)

ff5d2ef

Signed-off-by: Kyuyeun Kim <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[TPU] Add support for online w8a8 quantization #22425

[TPU] Add support for online w8a8 quantization #22425

Uh oh!

kyuyeunk commented Aug 7, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Aug 7, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Aug 7, 2025

Uh oh!

yaochengji left a comment

Uh oh!

kyuyeunk commented Aug 7, 2025

Uh oh!

yaochengji left a comment

Uh oh!

kyuyeunk commented Aug 8, 2025

Uh oh!

yaochengji commented Aug 8, 2025

Uh oh!

kyuyeunk commented Aug 9, 2025

Uh oh!

kyuyeunk commented Aug 9, 2025

Uh oh!

yaochengji left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[TPU] Add support for online w8a8 quantization #22425

[TPU] Add support for online w8a8 quantization #22425

Uh oh!

Conversation

kyuyeunk commented Aug 7, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 7, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

yaochengji left a comment

Choose a reason for hiding this comment

Uh oh!

kyuyeunk commented Aug 7, 2025

Uh oh!

yaochengji left a comment

Choose a reason for hiding this comment

Uh oh!

kyuyeunk commented Aug 8, 2025

Uh oh!

yaochengji commented Aug 8, 2025

Uh oh!

kyuyeunk commented Aug 9, 2025

Uh oh!

kyuyeunk commented Aug 9, 2025

Uh oh!

yaochengji left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kyuyeunk commented Aug 7, 2025 •

edited by github-actions bot

Loading