[XPU] IPEX-optimized Punica Wrapper on XPU #21703

chaojun-zhang · 2025-07-28T03:41:13Z

Test Plan

XPU_USE_TRITION_KERNEL=0 VLLM_USE_V1=1 CCL_ZE_IPC_EXCHANGE=drmfd VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 VLLM_WORKER_MULTIPROC_METHOD=spawn python3 benchmarks/benchmark_throughput.py --model meta-llama/Llama-2-7b-hf --backend vllm --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 --max-loras 1 --max-lora-rank 8 --enable-lora --lora-path "yard1/llama-2-7b-sql-lora-test" --enforce_eager

xpu kernel + XPU_USE_TRITION_KERNEL=0

Adding requests: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 1147.75it/s]
Processed prompts: 74%|████████████████████████████████████████████████████████████████████████████████████████▋ | 745/1000 [02:13<00:36, 6.92it/s, est. speed input: 1484.93 toks/s, output: 987.98 toks/s]WARNING 07-27 20:30:06 [_logger.py:68] Encountered invalid prefix detokenization error for request 623, resetting decode stream.
Processed prompts: 86%|█████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 861/1000 [02:27<00:13, 10.22it/s, est. speed input: 1523.34 toks/s, output: 1096.47 toks/s]WARNING 07-27 20:30:20 [_logger.py:68] Encountered invalid prefix detokenization error for request 623, resetting decode stream.
Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [03:08<00:00, 5.31it/s, est. speed input: 1306.31 toks/s, output: 1249.40 toks/s]
Throughput: 5.29 requests/s, 2543.90 total tokens/s, 1243.62 output tokens/s
Total num prompt tokens: 245995
Total num output tokens: 235278

trition kernel + XPU_USE_TRITION_KERNEL=1

Adding requests: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 1356.26it/s]
Processed prompts: 38%|█████████████████████████████████████████████▋ | 384/1000 [01:20<02:13, 4.62it/s, est. speed input: 1260.77 toks/s, output: 625.78 toks/s]WARNING 07-27 20:33:58 [_logger.py:68] Encountered invalid prefix detokenization error for request 183, resetting decode stream.
Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [03:50<00:00, 4.34it/s, est. speed input: 1067.74 toks/s, output: 1021.23 toks/s]
Throughput: 4.33 requests/s, 2082.29 total tokens/s, 1017.96 output tokens/s
Total num prompt tokens: 245995
Total num output tokens: 235278

The XPU kernel achieved a 1.22x throughput improvement (from 2082.29 to 2543.90) over the Triton kernel

gemini-code-assist

Code Review

This pull request introduces a new Punica wrapper for XPU using IPEX kernels, which shows significant performance improvement. The implementation looks solid, but I've found a few issues that need to be addressed. There's a critical typo in an environment variable name that would prevent it from working as intended. Additionally, there are a couple of high-severity issues related to deprecated API usage and a return type violation that should be fixed for code quality and future compatibility.

gemini-code-assist · 2025-07-28T03:42:17Z

vllm/platforms/xpu.py

There's a typo in the environment variable name. It should be XPU_USE_TRITON_KERNEL instead of XPU_USE_TRITION_KERNEL. This typo will prevent users from correctly enabling the Triton kernel path.

Suggested change

xpu_use_triton_kernel = os.getenv("XPU_USE_TRITION_KERNEL", "0") == "1"

xpu_use_triton_kernel = os.getenv("XPU_USE_TRITON_KERNEL", "0") == "1"

gemini-code-assist · 2025-07-28T03:42:18Z

vllm/lora/ops/ipex_ops/lora_ops.py

The msg attribute of ImportError is deprecated since Python 3.10 and will be removed in Python 3.12. Use e.args[0] to access the error message for better future compatibility.

Suggested change

logger.warning("Import error msg: %s", e.msg)

logger.warning("Import error msg: %s", e.args[0])

gemini-code-assist · 2025-07-28T03:42:18Z

vllm/lora/punica_wrapper/punica_xpu.py

This function is type-hinted to return None, but it returns a value. The operations on y are performed in-place, so this return statement is unnecessary and contradicts the function's signature. Remove this line.

github-actions · 2025-07-28T03:46:41Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

vllm/lora/punica_wrapper/punica_xpu.py

yma11 · 2025-07-28T05:46:40Z

vllm/platforms/xpu.py

Maybe we keep this env in our internal code but for upstream, let's just use ipex kernel path.

Currently, my internal tests show that the Triton kernel is better than the XPU kernel in torch.compile mode. Keeping this option allows for its use in torch.compile?

I prefer name it to XPU_PUNICA_USE_TRITON

Remove this env for internal code now .

vllm/lora/ops/ipex_ops/lora_ops.py

vllm/lora/punica_wrapper/punica_xpu.py

vllm/lora/ops/ipex_ops/lora_ops.py

jeejeelee

LGTM, please address the pre-commit failure

Signed-off-by: chzhang <[email protected]>

Co-authored-by: Jee Jee Li <[email protected]> Signed-off-by: chzhang <[email protected]>

Signed-off-by: chzhang <[email protected]>

xuechendi · 2025-07-28T14:29:25Z

@DarkLight1337 @mgoin , may we get a quick review on this PR?

Signed-off-by: chzhang <[email protected]> Co-authored-by: Jee Jee Li <[email protected]>

Signed-off-by: chzhang <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Signed-off-by: x22x22 <[email protected]>

Signed-off-by: chzhang <[email protected]> Co-authored-by: Jee Jee Li <[email protected]>

Signed-off-by: chzhang <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>

Signed-off-by: chzhang <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Signed-off-by: Paul Pak <[email protected]>

Signed-off-by: chzhang <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Signed-off-by: Diego-Castan <[email protected]>

Signed-off-by: chzhang <[email protected]> Co-authored-by: Jee Jee Li <[email protected]>

gemini-code-assist bot reviewed Jul 28, 2025

View reviewed changes

jeejeelee self-assigned this Jul 28, 2025

chaojun-zhang marked this pull request as ready for review July 28, 2025 05:35

chaojun-zhang requested review from jikunshang and jeejeelee as code owners July 28, 2025 05:35

chaojun-zhang changed the title ~~[XPU] xpu punica wrapper with ipex kernel~~ [XPU] IPEX-optimized Punica Wrapper on XPU Jul 28, 2025

yma11 reviewed Jul 28, 2025

View reviewed changes

vllm/lora/punica_wrapper/punica_xpu.py Outdated Show resolved Hide resolved

yma11 reviewed Jul 28, 2025

View reviewed changes

vllm/lora/ops/ipex_ops/lora_ops.py Outdated Show resolved Hide resolved

jeejeelee reviewed Jul 28, 2025

View reviewed changes

vllm/lora/punica_wrapper/punica_xpu.py Outdated Show resolved Hide resolved

jeejeelee reviewed Jul 28, 2025

View reviewed changes

vllm/lora/ops/ipex_ops/lora_ops.py Outdated Show resolved Hide resolved

yma11 approved these changes Jul 28, 2025

View reviewed changes

jeejeelee approved these changes Jul 28, 2025

View reviewed changes

chaojun-zhang and others added 8 commits July 28, 2025 04:04

add xpu punica wrapper

87fb0ae

Signed-off-by: chzhang <[email protected]>

add xpu punica wrapper

dda7871

Signed-off-by: chzhang <[email protected]>

add xpu punica wrapper

a6deb46

Signed-off-by: chzhang <[email protected]>

IPEX-optimized Punica Wrapper on XPU

c1a6954

Signed-off-by: chzhang <[email protected]>

IPEX-optimized Punica Wrapper on XPU

0ebad5b

Signed-off-by: chzhang <[email protected]>

raise exception instead of logging

61c2744

Co-authored-by: Jee Jee Li <[email protected]> Signed-off-by: chzhang <[email protected]>

IPEX-optimized Punica Wrapper on XPU

8712883

Signed-off-by: chzhang <[email protected]>

IPEX-optimized Punica Wrapper on XPU

76ff0c2

Signed-off-by: chzhang <[email protected]>

chaojun-zhang force-pushed the punica_wrapper_upstream branch from c7eba43 to 76ff0c2 Compare July 28, 2025 11:06

DarkLight1337 enabled auto-merge (squash) July 28, 2025 14:30

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 28, 2025

DarkLight1337 approved these changes Jul 28, 2025

View reviewed changes

DarkLight1337 merged commit ec261b0 into vllm-project:main Jul 28, 2025
77 checks passed

liuyumoye pushed a commit to liuyumoye/vllm that referenced this pull request Jul 31, 2025

[XPU] IPEX-optimized Punica Wrapper on XPU (vllm-project#21703)

9c6e0dc

Signed-off-by: chzhang <[email protected]> Co-authored-by: Jee Jee Li <[email protected]>

HsChen-sys pushed a commit to HsChen-sys/vllm that referenced this pull request Aug 1, 2025

[XPU] IPEX-optimized Punica Wrapper on XPU (vllm-project#21703)

16b0d7c

Signed-off-by: chzhang <[email protected]> Co-authored-by: Jee Jee Li <[email protected]>

x22x22 pushed a commit to x22x22/vllm that referenced this pull request Aug 5, 2025

[XPU] IPEX-optimized Punica Wrapper on XPU (vllm-project#21703)

8bf3bd2

Signed-off-by: chzhang <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Signed-off-by: x22x22 <[email protected]>

Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Aug 6, 2025

[XPU] IPEX-optimized Punica Wrapper on XPU (vllm-project#21703)

ef145d7

Signed-off-by: chzhang <[email protected]> Co-authored-by: Jee Jee Li <[email protected]>

npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025

[XPU] IPEX-optimized Punica Wrapper on XPU (vllm-project#21703)

1fa46ad

Signed-off-by: chzhang <[email protected]> Co-authored-by: Jee Jee Li <[email protected]>

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025

[XPU] IPEX-optimized Punica Wrapper on XPU (vllm-project#21703)

478e6a2

Signed-off-by: chzhang <[email protected]> Co-authored-by: Jee Jee Li <[email protected]>

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025

[XPU] IPEX-optimized Punica Wrapper on XPU (vllm-project#21703)

551055a

Signed-off-by: chzhang <[email protected]> Co-authored-by: Jee Jee Li <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[XPU] IPEX-optimized Punica Wrapper on XPU #21703

[XPU] IPEX-optimized Punica Wrapper on XPU #21703

Uh oh!

chaojun-zhang commented Jul 28, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jul 28, 2025

Uh oh!

gemini-code-assist bot Jul 28, 2025

Uh oh!

gemini-code-assist bot Jul 28, 2025

Uh oh!

github-actions bot commented Jul 28, 2025

Uh oh!

Uh oh!

yma11 Jul 28, 2025

Uh oh!

chaojun-zhang Jul 28, 2025

Uh oh!

jikunshang Jul 28, 2025

Uh oh!

chaojun-zhang Jul 28, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jeejeelee left a comment

Uh oh!

xuechendi commented Jul 28, 2025

Uh oh!

Uh oh!

Uh oh!

	xpu_use_triton_kernel = os.getenv("XPU_USE_TRITION_KERNEL", "0") == "1"
	xpu_use_triton_kernel = os.getenv("XPU_USE_TRITON_KERNEL", "0") == "1"

	logger.warning("Import error msg: %s", e.msg)
	logger.warning("Import error msg: %s", e.args[0])

Uh oh!

[XPU] IPEX-optimized Punica Wrapper on XPU #21703

[XPU] IPEX-optimized Punica Wrapper on XPU #21703

Uh oh!

Conversation

chaojun-zhang commented Jul 28, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jul 28, 2025

Uh oh!

Uh oh!

yma11 Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

chaojun-zhang Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

jikunshang Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

chaojun-zhang Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jeejeelee left a comment

Choose a reason for hiding this comment

Uh oh!

xuechendi commented Jul 28, 2025

Uh oh!

Uh oh!

Uh oh!

chaojun-zhang commented Jul 28, 2025 •

edited by github-actions bot

Loading