Skip to content

Conversation

kyuyeunk
Copy link
Contributor

@kyuyeunk kyuyeunk commented Aug 7, 2025

Expand existing online w8a16 quantization to support w8a8

  • Using {'activation_scheme': 'dynamic'} utilizes dynamic activation quantization
  • Using {'activation_scheme': 'none'} utilizes weight only quantization

Copy link

github-actions bot commented Aug 7, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the tpu Related to Google TPUs label Aug 7, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for w8a8 quantization on TPUs by introducing a 'dynamic' activation scheme. The changes correctly plumb this new option down to a new torch_xla operator. My review focuses on improving the error handling for cases where an incompatible version of torch_xla is installed, to provide a better user experience.

Comment on lines 112 to 118
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current try...except block only catches an ImportError if torch_xla or the custom_kernel module is missing. However, it doesn't handle the case where torch_xla is installed but is an older version that doesn't have the quantized_matmul_int8 operator. In that scenario, an AttributeError will be raised later when the operator is called, which can be confusing for the user.

To provide a better user experience and a more informative error message, it's best to check for the operator's existence within the try block and catch both ImportError and AttributeError. This ensures that users with an incompatible torch_xla version get a clear message on how to resolve the issue.

        try:
            import torch_xla.experimental.custom_kernel  # noqa: F401
            # Eagerly check for the op to provide a better error message.
            _ = torch.ops.xla.quantized_matmul_int8
        except (ImportError, AttributeError) as err:
            raise ImportError(
                "torch_xla is not installed or is too old to support w8a8 "
                "quantization. Please install/update torch_xla by following "
                "the instructions at "
                "https://docs.vllm.ai/en/latest/getting_started/tpu-installation.html "  # noqa: E501
                "to run vLLM on TPU.") from err

Copy link
Collaborator

@yaochengji yaochengji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution, @kyuyeunk .

Could you also add a test for this feature?

@kyuyeunk kyuyeunk force-pushed the online_w8a8_tpu branch 2 times, most recently from 053a8d5 to 5232090 Compare August 7, 2025 22:32
@kyuyeunk
Copy link
Contributor Author

kyuyeunk commented Aug 7, 2025

Thanks for the contribution, @kyuyeunk .

Could you also add a test for this feature?

done. @yaochengji , can you take a look again?

@kyuyeunk kyuyeunk force-pushed the online_w8a8_tpu branch 2 times, most recently from 76bf91b to d4a94ac Compare August 8, 2025 16:30
Copy link
Collaborator

@yaochengji yaochengji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the test!

Could move it to tests/v1/tpu?
And then add the test in CI .buildkite/scripts/hardware_ci/run-tpu-v1-test-part2.sh?

Also could we also check the response, at least to make sure it doesn't generate garbage?

@kyuyeunk
Copy link
Contributor Author

kyuyeunk commented Aug 8, 2025

Thanks for adding the test!

Could move it to tests/v1/tpu? And then add the test in CI .buildkite/scripts/hardware_ci/run-tpu-v1-test-part2.sh?

Also could we also check the response, at least to make sure it doesn't generate garbage?

Done. And verified that it produces correct response.

  • prompt: 'vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\n'
  • response: 'vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\nThe core of the engine is a distributed memory system'

@yaochengji
Copy link
Collaborator

prompt: 'vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\n'
response: 'vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\nThe core of the engine is a distributed memory system'

@kyuyeunk Could you add the response check in the test code?

@kyuyeunk kyuyeunk force-pushed the online_w8a8_tpu branch 2 times, most recently from 47fe959 to ea92156 Compare August 8, 2025 23:37
@kyuyeunk
Copy link
Contributor Author

kyuyeunk commented Aug 9, 2025

prompt: 'vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\n'
response: 'vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\nThe core of the engine is a distributed memory system'

@kyuyeunk Could you add the response check in the test code?

Do you have a suggested to achieve this? Not sure what is the best automated approach to check if an output is garbage or not.

Expand existing online w8a16 quantization to support w8a8

- Using `{'activation_scheme': 'dynamic'}` utilizes dynamic activation quantization
- Using `{'activation_scheme': 'none'}` utilizes weight only quantization

Signed-off-by: Kyuyeun Kim <[email protected]>
@kyuyeunk
Copy link
Contributor Author

kyuyeunk commented Aug 9, 2025

prompt: 'vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\n'
response: 'vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\nThe core of the engine is a distributed memory system'

@kyuyeunk Could you add the response check in the test code?

Do you have a suggested to achieve this? Not sure what is the best automated approach to check if an output is garbage or not.

Done!

Copy link
Collaborator

@yaochengji yaochengji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@yaochengji yaochengji enabled auto-merge (squash) August 9, 2025 01:05
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 9, 2025
@vllm-bot vllm-bot merged commit 9a0c5de into vllm-project:main Aug 9, 2025
49 of 56 checks passed
@kyuyeunk kyuyeunk deleted the online_w8a8_tpu branch August 9, 2025 06:16
paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025
diegocastanibm pushed a commit to diegocastanibm/vllm that referenced this pull request Aug 15, 2025
yiliu30 pushed a commit to yiliu30/vllm-fork that referenced this pull request Aug 19, 2025
epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025
xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build ready ONLY add when PR is ready to merge/full CI is needed tpu Related to Google TPUs v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants