Skip to content

Conversation

chaojun-zhang
Copy link
Contributor

@chaojun-zhang chaojun-zhang commented Jul 28, 2025

Test Plan

XPU_USE_TRITION_KERNEL=0 VLLM_USE_V1=1 CCL_ZE_IPC_EXCHANGE=drmfd VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 VLLM_WORKER_MULTIPROC_METHOD=spawn python3 benchmarks/benchmark_throughput.py --model meta-llama/Llama-2-7b-hf --backend vllm --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 --max-loras 1 --max-lora-rank 8 --enable-lora --lora-path "yard1/llama-2-7b-sql-lora-test" --enforce_eager

  1. xpu kernel + XPU_USE_TRITION_KERNEL=0

Adding requests: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 1147.75it/s]
Processed prompts: 74%|████████████████████████████████████████████████████████████████████████████████████████▋ | 745/1000 [02:13<00:36, 6.92it/s, est. speed input: 1484.93 toks/s, output: 987.98 toks/s]WARNING 07-27 20:30:06 [_logger.py:68] Encountered invalid prefix detokenization error for request 623, resetting decode stream.
Processed prompts: 86%|█████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 861/1000 [02:27<00:13, 10.22it/s, est. speed input: 1523.34 toks/s, output: 1096.47 toks/s]WARNING 07-27 20:30:20 [_logger.py:68] Encountered invalid prefix detokenization error for request 623, resetting decode stream.
Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [03:08<00:00, 5.31it/s, est. speed input: 1306.31 toks/s, output: 1249.40 toks/s]
Throughput: 5.29 requests/s, 2543.90 total tokens/s, 1243.62 output tokens/s
Total num prompt tokens: 245995
Total num output tokens: 235278

  1. trition kernel + XPU_USE_TRITION_KERNEL=1

Adding requests: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 1356.26it/s]
Processed prompts: 38%|█████████████████████████████████████████████▋ | 384/1000 [01:20<02:13, 4.62it/s, est. speed input: 1260.77 toks/s, output: 625.78 toks/s]WARNING 07-27 20:33:58 [_logger.py:68] Encountered invalid prefix detokenization error for request 183, resetting decode stream.
Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [03:50<00:00, 4.34it/s, est. speed input: 1067.74 toks/s, output: 1021.23 toks/s]
Throughput: 4.33 requests/s, 2082.29 total tokens/s, 1017.96 output tokens/s
Total num prompt tokens: 245995
Total num output tokens: 235278

The XPU kernel achieved a 1.22x throughput improvement (from 2082.29 to 2543.90) over the Triton kernel

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new Punica wrapper for XPU using IPEX kernels, which shows significant performance improvement. The implementation looks solid, but I've found a few issues that need to be addressed. There's a critical typo in an environment variable name that would prevent it from working as intended. Additionally, there are a couple of high-severity issues related to deprecated API usage and a return type violation that should be fixed for code quality and future compatibility.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There's a typo in the environment variable name. It should be XPU_USE_TRITON_KERNEL instead of XPU_USE_TRITION_KERNEL. This typo will prevent users from correctly enabling the Triton kernel path.

Suggested change
xpu_use_triton_kernel = os.getenv("XPU_USE_TRITION_KERNEL", "0") == "1"
xpu_use_triton_kernel = os.getenv("XPU_USE_TRITON_KERNEL", "0") == "1"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The msg attribute of ImportError is deprecated since Python 3.10 and will be removed in Python 3.12. Use e.args[0] to access the error message for better future compatibility.

Suggested change
logger.warning("Import error msg: %s", e.msg)
logger.warning("Import error msg: %s", e.args[0])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This function is type-hinted to return None, but it returns a value. The operations on y are performed in-place, so this return statement is unnecessary and contradicts the function's signature. Remove this line.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@jeejeelee jeejeelee self-assigned this Jul 28, 2025
@chaojun-zhang chaojun-zhang marked this pull request as ready for review July 28, 2025 05:35
@chaojun-zhang chaojun-zhang changed the title [XPU] xpu punica wrapper with ipex kernel [XPU] IPEX-optimized Punica Wrapper on XPU Jul 28, 2025
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we keep this env in our internal code but for upstream, let's just use ipex kernel path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, my internal tests show that the Triton kernel is better than the XPU kernel in torch.compile mode. Keeping this option allows for its use in torch.compile?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer name it to XPU_PUNICA_USE_TRITON

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this env for internal code now .

Copy link
Collaborator

@jeejeelee jeejeelee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please address the pre-commit failure

@chaojun-zhang chaojun-zhang force-pushed the punica_wrapper_upstream branch from c7eba43 to 76ff0c2 Compare July 28, 2025 11:06
@xuechendi
Copy link
Contributor

@DarkLight1337 @mgoin , may we get a quick review on this PR?

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) July 28, 2025 14:30
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 28, 2025
@DarkLight1337 DarkLight1337 merged commit ec261b0 into vllm-project:main Jul 28, 2025
77 checks passed
liuyumoye pushed a commit to liuyumoye/vllm that referenced this pull request Jul 31, 2025
HsChen-sys pushed a commit to HsChen-sys/vllm that referenced this pull request Aug 1, 2025
x22x22 pushed a commit to x22x22/vllm that referenced this pull request Aug 5, 2025
Signed-off-by: chzhang <[email protected]>
Co-authored-by: Jee Jee Li <[email protected]>
Signed-off-by: x22x22 <[email protected]>
Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Aug 6, 2025
npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025
jinzhen-lin pushed a commit to jinzhen-lin/vllm that referenced this pull request Aug 9, 2025
Signed-off-by: chzhang <[email protected]>
Co-authored-by: Jee Jee Li <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025
Signed-off-by: chzhang <[email protected]>
Co-authored-by: Jee Jee Li <[email protected]>
Signed-off-by: Paul Pak <[email protected]>
diegocastanibm pushed a commit to diegocastanibm/vllm that referenced this pull request Aug 15, 2025
Signed-off-by: chzhang <[email protected]>
Co-authored-by: Jee Jee Li <[email protected]>
Signed-off-by: Diego-Castan <[email protected]>
epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants