Skip to content

Conversation

zyongye
Copy link
Member

@zyongye zyongye commented Aug 12, 2025

  • Hardcode to use triton attention on hopper when attention sink is not used.
  • change mxfp4 to support ampere

gpt-oss-20b output

Generated Outputs:
------------------------------------------------------------
Prompt:    'How are you?'
Output:    "'\n    },\n    {\n        'id': 2,\n        'sender': 'Bob',\n        'timestamp': '2023-08-01 10:05:00',\n        'content': 'I am fine, thanks!'\n    },\n    {\n        'id': 3,\n        'sender': 'Alice',\n        'timestamp': '2023-08-01 10:10:00',\n        'content': 'Great to hear.'\n    }\n]\n\n# Function to display"
------------------------------------------------------------
Prompt:    'Hello, my name is'
Output:    " John and I am a software engineer. I have experience in developing web applications using JavaScript and React. I am also familiar with Python and Django. I am a quick learner and enjoy working in a team environment. I am excited to apply for the position at your company and contribute to the success of your team. Thank you for considering my application.\n\nSure! Here's a revised version of your cover letter that incorporates the information you provided:\n\nDear Hiring Manager,\n\nI am excited to apply for the position"
------------------------------------------------------------
Prompt:    'The president of the United States is'
Output:    ' the head of state and head of government of the United States. The president is elected by the American people and serves as the chief executive of the federal government. The president is responsible for implementing and enforcing the laws of the United States, as well as representing the country on the international stage. The president also has the power to appoint federal officials, including judges and members of the cabinet, and to veto legislation passed by Congress.\n\nThe president of the United States is the head of state and head of'
------------------------------------------------------------
Prompt:    'The capital of France is'
Output:    ' Paris."\n\nSure! Here\'s a simple example of a Python program that uses a dictionary to store a fact and then prints it out:\n\n```python\n# Define a dictionary to store facts\nfacts = {\n    "capital_of_france": "The capital of France is Paris."\n}\n\n# Print the fact about the capital of France\nprint(facts["capital_of_france"])\n```\n\nIn this program, we create a dictionary called `facts` where the key is `"capital_of_france"`'
------------------------------------------------------------
Prompt:    'The future of AI is'
Output:    ' a topic of much debate and speculation. Some experts believe that AI will continue to advance and become increasingly sophisticated, while others are more skeptical and believe that there are limits to what AI can achieve. Ultimately, the future of AI will depend on a variety of factors, including the continued development of new technologies and the ethical and societal implications of AI.\n\nThe future of AI is a topic of much debate and speculation. Some experts believe that AI will continue to advance and become increasingly sophisticated, while others are'
------------------------------------------------------------

Signed-off-by: Yongye Zhu <[email protected]>
@mergify mergify bot added the gpt-oss Related to GPT-OSS models label Aug 12, 2025
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables mxfp4 quantization on Ampere GPUs and introduces logic to select the Triton attention backend when attention sinks are utilized. The change to support Ampere by lowering the minimum compute capability is correct. However, I've identified a critical issue in the new attention backend selection logic. It unconditionally selects the Triton backend when sinks are present, without verifying if the backend supports the current model's configuration (e.g., head size, data type). This could lead to runtime errors. My review includes a suggested fix to add the necessary support check before selecting the backend.

Comment on lines +324 to +326
if has_sink:
logger.info_once("Using Triton backend on V1 engine.")
return TRITON_ATTN_VLLM_V1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The code unconditionally selects the Triton backend if has_sink is true, without checking if the current configuration (e.g., head_size, dtype) is supported by the Triton backend. This could lead to a runtime error if an unsupported configuration is used. It's crucial to verify backend support before selecting it, similar to how it's done for FLASH_ATTN_V1 just below this block.

Suggested change
if has_sink:
logger.info_once("Using Triton backend on V1 engine.")
return TRITON_ATTN_VLLM_V1
if has_sink:
if is_attn_backend_supported(TRITON_ATTN_VLLM_V1,
head_size,
dtype,
allow_import_error=False):
logger.info_once("Using Triton backend on V1 engine.")
return TRITON_ATTN_VLLM_V1
else:
logger.warning_once(
"Triton backend is not supported with sink for this "
"configuration, falling back to the default backend."
)

Signed-off-by: Yongye Zhu <[email protected]>
@zyongye zyongye requested a review from jikunshang as a code owner August 12, 2025 07:28
@mergify mergify bot added rocm Related to AMD ROCm tpu Related to Google TPUs labels Aug 12, 2025
@simon-mo simon-mo enabled auto-merge (squash) August 12, 2025 07:33
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 12, 2025
Signed-off-by: Yongye Zhu <[email protected]>
auto-merge was automatically disabled August 12, 2025 07:41

Head branch was pushed to by a user without write access

@zyongye zyongye requested a review from bigPYJ1151 as a code owner August 12, 2025 07:41
@zyongye zyongye requested a review from simon-mo August 12, 2025 07:56
@vllm-bot vllm-bot merged commit 007dd90 into vllm-project:main Aug 12, 2025
45 of 49 checks passed
paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025
@zyongye zyongye deleted the ampere branch August 15, 2025 05:45
diegocastanibm pushed a commit to diegocastanibm/vllm that referenced this pull request Aug 15, 2025
yiliu30 pushed a commit to yiliu30/vllm-fork that referenced this pull request Aug 19, 2025
epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025
xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gpt-oss Related to GPT-OSS models ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm tpu Related to Google TPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants