-
-
Notifications
You must be signed in to change notification settings - Fork 10.4k
[XPU] IPEX-optimized Punica Wrapper on XPU #21703
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[XPU] IPEX-optimized Punica Wrapper on XPU #21703
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new Punica wrapper for XPU using IPEX kernels, which shows significant performance improvement. The implementation looks solid, but I've found a few issues that need to be addressed. There's a critical typo in an environment variable name that would prevent it from working as intended. Additionally, there are a couple of high-severity issues related to deprecated API usage and a return type violation that should be fixed for code quality and future compatibility.
vllm/platforms/xpu.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a typo in the environment variable name. It should be XPU_USE_TRITON_KERNEL
instead of XPU_USE_TRITION_KERNEL
. This typo will prevent users from correctly enabling the Triton kernel path.
xpu_use_triton_kernel = os.getenv("XPU_USE_TRITION_KERNEL", "0") == "1" | |
xpu_use_triton_kernel = os.getenv("XPU_USE_TRITON_KERNEL", "0") == "1" |
vllm/lora/ops/ipex_ops/lora_ops.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
vllm/platforms/xpu.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we keep this env in our internal code but for upstream, let's just use ipex kernel path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, my internal tests show that the Triton kernel is better than the XPU kernel in torch.compile mode. Keeping this option allows for its use in torch.compile?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer name it to XPU_PUNICA_USE_TRITON
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this env for internal code now .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, please address the pre-commit failure
Signed-off-by: chzhang <[email protected]>
Signed-off-by: chzhang <[email protected]>
Signed-off-by: chzhang <[email protected]>
Signed-off-by: chzhang <[email protected]>
Signed-off-by: chzhang <[email protected]>
Co-authored-by: Jee Jee Li <[email protected]> Signed-off-by: chzhang <[email protected]>
Signed-off-by: chzhang <[email protected]>
Signed-off-by: chzhang <[email protected]>
c7eba43
to
76ff0c2
Compare
@DarkLight1337 @mgoin , may we get a quick review on this PR? |
Signed-off-by: chzhang <[email protected]> Co-authored-by: Jee Jee Li <[email protected]>
Signed-off-by: chzhang <[email protected]> Co-authored-by: Jee Jee Li <[email protected]>
Signed-off-by: chzhang <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Signed-off-by: x22x22 <[email protected]>
Signed-off-by: chzhang <[email protected]> Co-authored-by: Jee Jee Li <[email protected]>
Signed-off-by: chzhang <[email protected]> Co-authored-by: Jee Jee Li <[email protected]>
Signed-off-by: chzhang <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: chzhang <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Signed-off-by: Paul Pak <[email protected]>
Signed-off-by: chzhang <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Signed-off-by: Diego-Castan <[email protected]>
Signed-off-by: chzhang <[email protected]> Co-authored-by: Jee Jee Li <[email protected]>
Signed-off-by: chzhang <[email protected]> Co-authored-by: Jee Jee Li <[email protected]>
Test Plan
XPU_USE_TRITION_KERNEL=0 VLLM_USE_V1=1 CCL_ZE_IPC_EXCHANGE=drmfd VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 VLLM_WORKER_MULTIPROC_METHOD=spawn python3 benchmarks/benchmark_throughput.py --model meta-llama/Llama-2-7b-hf --backend vllm --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 --max-loras 1 --max-lora-rank 8 --enable-lora --lora-path "yard1/llama-2-7b-sql-lora-test" --enforce_eager
Adding requests: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 1147.75it/s]
Processed prompts: 74%|████████████████████████████████████████████████████████████████████████████████████████▋ | 745/1000 [02:13<00:36, 6.92it/s, est. speed input: 1484.93 toks/s, output: 987.98 toks/s]WARNING 07-27 20:30:06 [_logger.py:68] Encountered invalid prefix detokenization error for request 623, resetting decode stream.
Processed prompts: 86%|█████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 861/1000 [02:27<00:13, 10.22it/s, est. speed input: 1523.34 toks/s, output: 1096.47 toks/s]WARNING 07-27 20:30:20 [_logger.py:68] Encountered invalid prefix detokenization error for request 623, resetting decode stream.
Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [03:08<00:00, 5.31it/s, est. speed input: 1306.31 toks/s, output: 1249.40 toks/s]
Throughput: 5.29 requests/s, 2543.90 total tokens/s, 1243.62 output tokens/s
Total num prompt tokens: 245995
Total num output tokens: 235278
Adding requests: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 1356.26it/s]
Processed prompts: 38%|█████████████████████████████████████████████▋ | 384/1000 [01:20<02:13, 4.62it/s, est. speed input: 1260.77 toks/s, output: 625.78 toks/s]WARNING 07-27 20:33:58 [_logger.py:68] Encountered invalid prefix detokenization error for request 183, resetting decode stream.
Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [03:50<00:00, 4.34it/s, est. speed input: 1067.74 toks/s, output: 1021.23 toks/s]
Throughput: 4.33 requests/s, 2082.29 total tokens/s, 1017.96 output tokens/s
Total num prompt tokens: 245995
Total num output tokens: 235278
The XPU kernel achieved a 1.22x throughput improvement (from 2082.29 to 2543.90) over the Triton kernel