[gpt-oss] Enable gpt-oss on ampere #22714

zyongye · 2025-08-12T07:10:07Z

Hardcode to use triton attention on hopper when attention sink is not used.
change mxfp4 to support ampere

gpt-oss-20b output

Generated Outputs:
------------------------------------------------------------
Prompt:    'How are you?'
Output:    "'\n    },\n    {\n        'id': 2,\n        'sender': 'Bob',\n        'timestamp': '2023-08-01 10:05:00',\n        'content': 'I am fine, thanks!'\n    },\n    {\n        'id': 3,\n        'sender': 'Alice',\n        'timestamp': '2023-08-01 10:10:00',\n        'content': 'Great to hear.'\n    }\n]\n\n# Function to display"
------------------------------------------------------------
Prompt:    'Hello, my name is'
Output:    " John and I am a software engineer. I have experience in developing web applications using JavaScript and React. I am also familiar with Python and Django. I am a quick learner and enjoy working in a team environment. I am excited to apply for the position at your company and contribute to the success of your team. Thank you for considering my application.\n\nSure! Here's a revised version of your cover letter that incorporates the information you provided:\n\nDear Hiring Manager,\n\nI am excited to apply for the position"
------------------------------------------------------------
Prompt:    'The president of the United States is'
Output:    ' the head of state and head of government of the United States. The president is elected by the American people and serves as the chief executive of the federal government. The president is responsible for implementing and enforcing the laws of the United States, as well as representing the country on the international stage. The president also has the power to appoint federal officials, including judges and members of the cabinet, and to veto legislation passed by Congress.\n\nThe president of the United States is the head of state and head of'
------------------------------------------------------------
Prompt:    'The capital of France is'
Output:    ' Paris."\n\nSure! Here\'s a simple example of a Python program that uses a dictionary to store a fact and then prints it out:\n\n```python\n# Define a dictionary to store facts\nfacts = {\n    "capital_of_france": "The capital of France is Paris."\n}\n\n# Print the fact about the capital of France\nprint(facts["capital_of_france"])\n```\n\nIn this program, we create a dictionary called `facts` where the key is `"capital_of_france"`'
------------------------------------------------------------
Prompt:    'The future of AI is'
Output:    ' a topic of much debate and speculation. Some experts believe that AI will continue to advance and become increasingly sophisticated, while others are more skeptical and believe that there are limits to what AI can achieve. Ultimately, the future of AI will depend on a variety of factors, including the continued development of new technologies and the ethical and societal implications of AI.\n\nThe future of AI is a topic of much debate and speculation. Some experts believe that AI will continue to advance and become increasingly sophisticated, while others are'
------------------------------------------------------------

Signed-off-by: Yongye Zhu <[email protected]>

github-actions · 2025-08-12T07:11:02Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request enables mxfp4 quantization on Ampere GPUs and introduces logic to select the Triton attention backend when attention sinks are utilized. The change to support Ampere by lowering the minimum compute capability is correct. However, I've identified a critical issue in the new attention backend selection logic. It unconditionally selects the Triton backend when sinks are present, without verifying if the backend supports the current model's configuration (e.g., head size, data type). This could lead to runtime errors. My review includes a suggested fix to add the necessary support check before selecting the backend.

gemini-code-assist · 2025-08-12T07:11:14Z

vllm/platforms/cuda.py

+                if has_sink:
+                    logger.info_once("Using Triton backend on V1 engine.")
+                    return TRITON_ATTN_VLLM_V1


The code unconditionally selects the Triton backend if has_sink is true, without checking if the current configuration (e.g., head_size, dtype) is supported by the Triton backend. This could lead to a runtime error if an unsupported configuration is used. It's crucial to verify backend support before selecting it, similar to how it's done for FLASH_ATTN_V1 just below this block.

Suggested change

if has_sink:

logger.info_once("Using Triton backend on V1 engine.")

return TRITON_ATTN_VLLM_V1

if has_sink:

if is_attn_backend_supported(TRITON_ATTN_VLLM_V1,

head_size,

dtype,

allow_import_error=False):

logger.info_once("Using Triton backend on V1 engine.")

return TRITON_ATTN_VLLM_V1

else:

logger.warning_once(

"Triton backend is not supported with sink for this "

"configuration, falling back to the default backend."

)

Signed-off-by: Yongye Zhu <[email protected]>

Signed-off-by: Yongye Zhu <[email protected]> Signed-off-by: Paul Pak <[email protected]>

Signed-off-by: Yongye Zhu <[email protected]> Signed-off-by: Diego-Castan <[email protected]>

Signed-off-by: Yongye Zhu <[email protected]>

Signed-off-by: Yongye Zhu <[email protected]> Signed-off-by: Xiao Yu <[email protected]>

Signed-off-by: Yongye Zhu <[email protected]>

fix ampere

6c6e9db

Signed-off-by: Yongye Zhu <[email protected]>

zyongye requested review from mgoin, robertgshaw2-redhat and tlrmchlsmth as code owners August 12, 2025 07:10

mergify bot added the gpt-oss Related to GPT-OSS models label Aug 12, 2025

gemini-code-assist bot reviewed Aug 12, 2025

View reviewed changes

fix interface

ec4a650

Signed-off-by: Yongye Zhu <[email protected]>

zyongye requested a review from jikunshang as a code owner August 12, 2025 07:28

mergify bot added rocm Related to AMD ROCm tpu Related to Google TPUs labels Aug 12, 2025

simon-mo approved these changes Aug 12, 2025

View reviewed changes

simon-mo enabled auto-merge (squash) August 12, 2025 07:33

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 12, 2025

add cpu

ea1fd24

Signed-off-by: Yongye Zhu <[email protected]>

auto-merge was automatically disabled August 12, 2025 07:41
Head branch was pushed to by a user without write access

zyongye requested a review from bigPYJ1151 as a code owner August 12, 2025 07:41

zyongye requested a review from simon-mo August 12, 2025 07:56

vllm-bot merged commit 007dd90 into vllm-project:main Aug 12, 2025
45 of 49 checks passed

paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025

[gpt-oss] Enable gpt-oss on ampere (vllm-project#22714)

50620e4

Signed-off-by: Yongye Zhu <[email protected]> Signed-off-by: Paul Pak <[email protected]>

zyongye deleted the ampere branch August 15, 2025 05:45

diegocastanibm pushed a commit to diegocastanibm/vllm that referenced this pull request Aug 15, 2025

[gpt-oss] Enable gpt-oss on ampere (vllm-project#22714)

5de09f4

Signed-off-by: Yongye Zhu <[email protected]> Signed-off-by: Diego-Castan <[email protected]>

yiliu30 pushed a commit to yiliu30/vllm-fork that referenced this pull request Aug 19, 2025

[gpt-oss] Enable gpt-oss on ampere (vllm-project#22714)

63a5b82

Signed-off-by: Yongye Zhu <[email protected]>

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025

[gpt-oss] Enable gpt-oss on ampere (vllm-project#22714)

da6627f

Signed-off-by: Yongye Zhu <[email protected]>

xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025

[gpt-oss] Enable gpt-oss on ampere (vllm-project#22714)

2205eba

Signed-off-by: Yongye Zhu <[email protected]> Signed-off-by: Xiao Yu <[email protected]>

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025

[gpt-oss] Enable gpt-oss on ampere (vllm-project#22714)

abe7e5c

Signed-off-by: Yongye Zhu <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[gpt-oss] Enable gpt-oss on ampere #22714

[gpt-oss] Enable gpt-oss on ampere #22714

Uh oh!

zyongye commented Aug 12, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Aug 12, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Aug 12, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[gpt-oss] Enable gpt-oss on ampere #22714

[gpt-oss] Enable gpt-oss on ampere #22714

Uh oh!

Conversation

zyongye commented Aug 12, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 12, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

zyongye commented Aug 12, 2025 •

edited by github-actions bot

Loading